from ChubakPDP11@programming.dev to programming_languages@programming.dev on 19 Apr 2024 06:30
https://programming.dev/post/12971027
Hey. I have made some DSLs before, and for all of them, most of which were in C, I have used Flex and Bison. But this time I wanna use Scheme, Cyclone Scheme to be exact. I can potentially use Flex/Bison this time too, because Cyclone has a strong FFI to C.
But I’d rather innovate. I have been writing down the language’s grammar in EBNF:
gist.github.com/…/bd54df78fe1f71f46cb262ba990a209…
And my thinking is, why not turn this into a parser? Not something like BNFC that translates BNF (not EBNF) into several parser and lexer specifications plus an AST, I want it to translate EBNF into Scheme.
But if you read the grammar, you will realize that there are some places where it’s not very descriptive and machine-friendly. For that reason, I think an LLM can help.
Now, I need your help. I am mostly active in systems programming. Like Assembly, C stuff. I don’t know much about LLMs and this whole AI revolution. I did some work as an ‘ML-engineer guy’ (not an ML engineer, an ML-engineer guy, there’s a difference!), so I know how this whole thing works. I have also read MITs standard book on mathematical optimization.
But I definitely need to use a pretrained model here. My knowledge of mathematical optimization is useless when you need like 28 million to train a model that would aide you with this?
I don’t want to use an API. I wanna own my software. I do use ChatGPT as a search engine, but that’s about it, I never owned Google anyways!
I know about HuggingFace. What model there do you think would help me?
Also, how do these weights work? If I bind one DNN framework to Cyclone, will the weights trained by another DNN framework work in it too? Do people use frameworks not written in C, so I would have to like triple-bind it? I know both Google’s and Facebook’s are in C. However they are in ‘garbage c’. Well let’s deal with that later.
Anyways, thanks for your help.
TL; DR:
I need an LLM that would be used in an EBNF -> Scheme parser generator.
threaded - newest
Now my compiler can hallucinate, what a time to be alive.
Ok.
I find writing the parser by hand (recursive descent) to be easiest. Sometimes I use a lexer generator, or if there isn’t a good one (idk for Scheme), write the lexer by hand as well. Define a few helper functions and macros to remove most of the boilerplate (you really benefit from Scheme here), and you almost end up writing the rules out directly.
Yes, you need to manually implement choice and figure out what/when to lookahead. Yes, the final parser won’t be as readable as a BNF specification. But I find writing a hand-rolled parser generator for most grammars, even with the manual choice and lookahead, is surprisingly easy and fast.
The problem with parser generators is that, when they work they work well, but when they don’t work (fail to generate, the generated parser tries to parse the wrong node, the generated parser is very inefficient) it can be really hard to figure out why. A hand-rolled parser is much easier to debug, so when your grammar inevitably has problems, it ends up taking less time in total to go from spec to working hand-rolled vs. spec to working parser-generator-generated.
The hand-rolled rules may look something like (with user-defined macros and functions
define-parse
,parse
,peek
,next
, and some simple rules likecon-id
andname-id
as individual tokens):Since you’re using Scheme, you can almost certainly optimize the above to reduce even more boilerplate.
Regarding LLMs: if you start to write the parser with the EBNF comments above each rule like above, you can paste the EBNF in and Copilot will generate rules for you. Alternatively, you can feed a couple EBNF/code examples to ChatGPT and ask it to generate more.
In both cases the AI will probably make mistakes on tricky cases, but that’s practically inevitable. An LLM implemented in an error-proof code synthesis engine would be a breakthrough; and while there are techniques like fine-tuning, I suspect they wouldn’t improve the accuracy much, and certainly would amount to more effort in total (in fact most LLM “applications” are just a custom prompt on plain ChatGPT or another baseline model).
You cannot believe how thankful I am of this. Thanks man!
I’m not sure I understand your reasoning. Here are the highlights for me.
Why? Scheme is a fine choice in general, but you should only tie yourself to a specific flavor of Scheme if you need specific features…
That’s a good reason! Keep in mind that there are usually multiple possible Schemes. In this case, I’d be aware of CHICKEN Scheme as well.
Scheme what? What sorts of programs? Presumably a lexer and parser and AST manipulation kit? In a certain sense, that’s all that one can do with grammars. Don’t get me wrong – if you can parse Antlr grammars, then you have hundreds of popular languages at your fingers. But I think that you need to clarify your three languages: your source, your target, and your implementation. I like to use tombstone diagrams for this.
No, I’m afraid not. Even if all the popular claims about them were true, LLMs wouldn’t help you understand compilers; at best, they might drop some phrases that can be looked up on Wikipedia.
Wants are not needs. The parser generator could be handwritten. This was a painful lesson for me when I was younger; I faffed around with a parser generator for a couple weeks, and then one of the old hands wrote two thousand lines over a week, lexer and parser and test cases.
Yes I realized as much that a hand-roller LP is best. Especially after I read Design Concepts [2008, MIT] and learned how to represent ASTs in S-Expressions. Thanks.