Do you often write scripts to parse a codebase and get familiar with it?
from j4k3@lemmy.world to programming@programming.dev on 09 Sep 2024 05:28
https://lemmy.world/post/19561248

Playing around with the FOSS game Cataclysm DDA, I felt compelled to parse and connect the CPP and JSON to see relationships and complexity. It’s the first time I’ve really felt motivated to do so. I’m just trying to wrap my head around how some features are implemented like z-levels, mining tools and various actions; simple stuff really. I find it challenging to parse something quite this large, so I started scripting a way to track down objects across the code base to see what is defined in JSON and what is hard coded. Normal? Obvious? FOSS alternatives to do this? I’m basically chaining a bunch of grep commands to print pretty trees with bat.

#programming

threaded - newest

FizzyOrange@programming.dev on 09 Sep 2024 06:26 next collapse

No but I think this is probably a great use case for AI. Haven’t tried it though.

31337@sh.itjust.works on 09 Sep 2024 06:52 collapse

Nah, LLMs have severe context window limitations. It starts to get wackier after ~1000 LOC.

FizzyOrange@programming.dev on 09 Sep 2024 07:15 next collapse

Gemini has a 1 million token limit. Also instead of just giving it the entire source you can give it a list of files and the ability to query them (e.g. to read an entire file, or search for usages/definitions of terms etc.).

astrsk@fedia.io on 09 Sep 2024 07:41 next collapse

In my experience, token limits mean nothing on larger context windows. 1 million tokens can easily be taken up by a very small amount of complex files. It also doesn’t do great traversing a tree to selectively find context which seems to be the most limiting factor I’ve run against trying to incorporate LLMs into complex and unknown (to me) projects. By the time I’ve sufficiently hunted down and provided the context, I’ve read enough of the codebase to answer most questions I was going to ask.

FizzyOrange@programming.dev on 09 Sep 2024 20:30 collapse

Right but presumably you can let the AI do that hunting.

31337@sh.itjust.works on 09 Sep 2024 07:44 collapse

Haven’t tried Gemini; may work. But, in my experience with other LLMs, even if text doesn’t exceed the token limit, LLMs start making more mistakes and sometimes behave strangely more often as the size of context grows.

j4k3@lemmy.world on 09 Sep 2024 07:42 collapse

Yeah this has been my experience too. LLMs don’t handle project specific code styles too well either. Or when there are several ways of doing things.

Actually, earlier today I was asking a mixtral 8x7b about some bash ideas. I kept getting suggestions to use find and sed commands which I find unreadable and inflexible for my evolving scripts. They are fine for some specific task need, but I’ll move to Python before I want to fuss with either.

Anyways, I changed the starting prompt to something like ‘Common sense questions and answers with Richard Stallman’s AI assistant.’ The results were remarkable and interesting on many levels. From the way the answers always terminated without continuing with another question/answer, to a short footnote about the static nature of LLM learning and capabilities, along with much better quality responses in general, the LLM knew how to respond on a much higher level than normal in this specific context. I think it is the combination of Stallman’s AI background and bash scripting that are powerful momentum builders here. I tried it on a whim, but it paid dividends and is a keeper of a prompting strategy.

Overall, the way my scripts are collecting relationships in the source code would probably result in a productive chunking strategy for a RAG agent. I don’t think an AI would be good at what I’m doing at this stage, but it could use that info. It might even be possible to integrate the scripts as a pseudo database in the LLM model loader code for further prompting.

degen@midwest.social on 09 Sep 2024 06:44 next collapse

To grep is to grok.

I have a grepconf alias for a find-grep loop on my nixos config that comes in handy. Treesitter can be a godsend too.

31337@sh.itjust.works on 09 Sep 2024 06:54 next collapse

I usually just use VS Code to do full-text searches, and write down notes in a note taking app. That, and browse the documentation.

LainTrain@lemmy.dbzer0.com on 09 Sep 2024 07:01 next collapse

This is a really neat idea. I’m frequently put off by large highly distributed (among files and dependencies) codebases with no obvious entry point. I wanted to make some changes to GNU’s mailutils and the code felt genuinely incomprehensible (BSD’s implementation of mail was a bit easier).

Perhaps another approach is to parse ptrace.

MoogleMaestro@lemmy.zip on 09 Sep 2024 07:31 next collapse

Even better: do a git history of certain files to get a broad sense of history and understand it’s evolution.

I highly advise this practice for familiarizing yourself with parts of a codebase you may otherwise not know anything about. Interesting commits you should git show.

Though combining this with scripting would also be interesting. 🤔

0x0@programming.dev on 09 Sep 2024 10:09 collapse

The code is my bible, the grep is my friend.

That and breakpoints.

grrgyle@slrpnk.net on 09 Sep 2024 21:51 collapse

this but ack