LLVM, MLIR (or any ready-made IR) are not a good fit for learners. Roll your own backend pls if you wanna learn (same shit with LP generators!)

LLVM, MLIR (or any ready-made IR) are not a good fit for learners. Roll your own backend pls if you wanna learn (same shit with LP generators!)
from ChubakPDP11@programming.dev to programming_languages@programming.dev on 03 May 2024 22:56
https://programming.dev/post/13611939

These toolchain are created for experts to create industrial-level compilers. Even if you think you got a syntactic design concept that is such hot shit that you can’t wait to get it bootstrapped, even if hell, it proves Rice’s theorem wrong, please, write a simple interpreter for it to prove your syntax works. In fact, I think one way you can test your language’s design is to have it mooch off an established VM like JVM, CPython’s VM or CLR.

But if you wanna ‘learn’ compiler design, I beg you to roll your own backend. You don’t need SSA or any of that shit. You don’t need to super-optimize the output at first try. Just make a tree-rewrite optimizer and that’s that.

Same is true with LP generators. From Yacc to ANTLR, they just make the experience harder and less rewarding. I know hand-rolling LP is hard in a language like C, in which case, don’t fucking use it lol. There’s honestly no inherent worth in using C in 2024 for compiler design.

But there’s still use for C in being the subject of your compiler. It’s a very simple, straightforward and more importantly, standardized language, you don’t need to write a runtime for it, because when it comes to both UNIX and Windows, runtime is OS itself! Just make sure you add a syscall interface and then C runtimes like glibc and CRT can be easily strapped.

I’m going to do exactly this. I named my C compiler ‘Cephyr’. I have started over several times now. I am using OCaml.

I know my point about using LP generators is preaching to the choir and most people despise them — but I just don’t understand why people love to use IRs/ILs when doing so teaches you shit.

I recommend beginning to design your language with the IR – your IR.

I don’t just wanna focus on Cephyr. There are other stuff I wanna do, like Nock, a PostScript interpreter in Rust (because GhostScript had made me hard-reset 4-5 times. GhostScript is just un-secure, leaky garbage).

Anyways tell me what you think about my ‘take’ on this. Of course I am not saying you are ‘less knowledgeable’ for using LLM or MLIR, I’m just saying, they don’t teach you stuff.

Still, some people just use LLVM and MLIR as a final ‘portable’ interface, having done the optimization on graphs and trees. i think there should be some sort of ‘retargatble assembly’ language. Like something with fixed number of registers which… oh wait that’s just a VM!

Also you don’t need to necessarily translate to a super super low-level language. Just target C. GNU C to be exact. Cyclone does that, in fact, I am planning to bootstrap my functional language, which I named ‘Xeph’, based on Zephyr ASDL, into GNU C as a test. Or I might use JVM. I dunno. JVM languages are big these days.

PS: Do you guys know any cool VMs I can target beside CPython and JVM? Something with AoT perhaps?

Thanks.

#programming_languages

threaded - newest

ericjmorey@programming.dev on 04 May 2024 02:37 next collapse

Do you guys know any cool VMs I can target beside CPython and JVM?

Erlang, Elixir, and Gleam target the BEAM VM.

ChubakPDP11@programming.dev on 04 May 2024 06:14 collapse

Noko

I just remember about Noko! Also, there’s the one which Perl 6 uses. Forgot its name. Noko seems better imho. Need to look into BEAM.

Thanks!

ananas@sopuli.xyz on 05 May 2024 12:10 next collapse

Good opener!

These toolchain are created for experts to create industrial-level compilers. Even if you think you got a syntactic design concept that is such hot shit that you can’t wait to get it bootstrapped, even if hell, it proves Rice’s theorem wrong, please, write a simple interpreter for it to prove your syntax works. In fact, I think one way you can test your language’s design is to have it mooch off an established VM like JVM, CPython’s VM or CLR.

I agree in principle, but mooching off established VMs will affect your overall language design since and significantly push you towards language “grammar” those VMs are built to deal with. Syntax is pretty irrelevant early on and should be made easy to change anyways.

But if you wanna ‘learn’ compiler design, I beg you to roll your own backend. You don’t need SSA or any of that shit. You don’t need to super-optimize the output at first try. Just make a tree-rewrite optimizer and that’s that.

I don’t think I really agree with this. Of course, if your goal is just to learn how everything works, rollng your own is the best option. But if you want to do more than that I think not taking advantage of tools available (such as LLVM) is suboptimal at best. It might be fine if you are unemployed or can slack off enough to spend copious amounts of time on your language, but you will spend your time on rewriting tiny details over and over that might be fun to make, but it won’t help with getting your language into usable state. There’s plenty of optimisations you can (and should) do even before you pass on anything to LLVM if the goal is to think about those.

Same is true with LP generators. From Yacc to ANTLR, they just make the experience harder and less rewarding. I know hand-rolling LP is hard in a language like C, in which case, don’t fucking use it lol. There’s honestly no inherent worth in using C in 2024 for compiler design.

I’m not sure I got your point correctly here, but if I did, I heavily disagree. Like it or not, unless you plan to write everything from hardware up from scratch in your language, you need to adhere to a lot of C stuff. Whatever your system C ABI is, that is the layer your language needs to be able to talk to. And that, for better or worse, requires to take C heavily into account.

But there’s still use for C in being the subject of your compiler. It’s a very simple, straightforward and more importantly, standardized language, you don’t need to write a runtime for it, because when it comes to both UNIX and Windows, runtime is OS itself! Just make sure you add a syscall interface and then C runtimes like glibc and CRT can be easily strapped.

Heavily disagree with C being either simple or straightforward, but totally agree with the important part that it is standardised.

I know my point about using LP generators is preaching to the choir and most people despise them — but I just don’t understand why people love to use IRs/ILs when doing so teaches you shit.

I recommend beginning to design your language with the IR – your IR.

Anyways tell me what you think about my ‘take’ on this. Of course I am not saying you are ‘less knowledgeable’ for using LLM or MLIR, I’m just saying, they don’t teach you stuff.

Still, some people just use LLVM and MLIR as a final ‘portable’ interface, having done the optimization on graphs and trees. i think there should be some sort of ‘retargatble assembly’ language. Like something with fixed number of registers which… oh wait that’s just a VM!

I mean, you answer your own questions here. People love to use IRs/ILs because things like LLVM IR are the closest thing we have to portable assembly we have, and it makes it simpler to get into running state. Most compilers have their own internal IR (maybe even more than one) anyways, regardless or not if they use something like LLVM. Understanding how LLVM/MLIR/whatever IR is designed is pretty important before you start rolling your own if you want it to be useful.

Also you don’t need to necessarily translate to a super super low-level language. Just target C. GNU C to be exact. Cyclone does that, in fact, I am planning to bootstrap my functional language, which I named ‘Xeph’, based on Zephyr ASDL, into GNU C as a test. Or I might use JVM. I dunno. JVM languages are big these days.

Targeting C is completely valid. Though I see no reason to use GNU C – just use the actual standardised versions. And I think what you target should be part of your language design and suit what your language’s purpose is, not stuff to pick at random wheneve

ChubakPDP11@programming.dev on 05 May 2024 23:46 collapse

I actually started work on a tool similar to Forth’s VMGEN. It will generate Stack VM’s in several languages leveraging m4, just like Bison. The difference between it, and VMGEN would be that it actually adds GC and threading. Based on Xiao-Feng Li’s book.

ananas@sopuli.xyz on 06 May 2024 08:58 collapse

That sounds interesting, I probably want to take a look at the book too.

porgamrer@programming.dev on 05 May 2024 16:22 collapse

My criticism is almost the opposite. I think LLVM is pretty easy and quite a good way of learning about low-level stuff in an incremental way. I just think it sucks to have it as a critical dependency in a mature product.

When I see a new language that is built around llvm I know the build time is going to be terrible.

As an aside, if I were teaching a compilers class I’d look quite seriously at wasm as a target. It’s pretty much the only target that is open-ended, high performance, and portable.

I’ve never used it though, so I don’t know how easy it would be for beginners to make sense of.

ChubakPDP11@programming.dev on 05 May 2024 23:41 next collapse

Ironically, I just made a thread on Gamedev instance about something like WASM, but for vidya. Except instead of WASM’s S-Expression I recommend a PostScript-like syntax. I gotta learn WASM.

Thanks.

ananas@sopuli.xyz on 06 May 2024 06:12 collapse

It’s pretty much the only target that is open-ended, high performance, and portable.

I wouldn’t call wasm really portable. And webassembly doesn’t call itself portable either.

Limiting out everything non-little-endian, requiring IEEE754-floating points and forward-progress guarantees is not really portable anymore.

porgamrer@programming.dev on 06 May 2024 19:37 collapse

Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.

This is front and centre on the homepage…

It makes assumptions about the native architecture for the sake of performance, but it’s still portable because you can implement all that behaviour in a VM if necessary. The important thing is that the behaviour is well defined.

It’s not perfect but I don’t think the situation is any worse than in Java, C#, Lua, etc. If your hardware has non-standard floats you’re going to have a bad time with any VM.

ananas@sopuli.xyz on 07 May 2024 04:06 collapse

This is front and centre on the homepage…

I stand corrected.

It makes assumptions about the native architecture for the sake of performance, but it’s still portable because you can implement all that behaviour in a VM if necessary. The important thing is that the behaviour is well defined.

It doesn’t have to be native architecture, just the execution environment (i.e. you can emulate it).

But I don’t see why by that logic any turing-complete language wouldn’t be in the group “portable”, including any hardware-specific assembly. Because you can always implement a translator that has well defined-behaviour? e.g. is 6502 assembly now portable that C64 and NES emulators are commonplace? (And there even is transpiler from x86 asm to 6502 asm). I do not think many of us see this as a good definition for portability since it makes the concept meaningless, except in cases where the translation layer is not feasible due to available processing power or memory.

It’s not perfect but I don’t think the situation is any worse than in Java, C#, Lua, etc. If your hardware has non-standard floats you’re going to have a bad time with any VM.

I mostly agree here. But I think portable language is not the same thing as a portable VM, and that portability of a language target is different from VM portability.

porgamrer@programming.dev on 07 May 2024 18:02 collapse

But I don’t see why by that logic any turing-complete language wouldn’t be in the group “portable”, including any hardware-specific assembly. Because you can always implement a translator that has well defined-behaviour?

What matters is the practical reality. Generally, languages are not portable when they don’t have well-defined behaviour, and when this causes their implementations to differ.

And thanks to this low standard for portability, a lot of VMs and high level languages are portable until you get to the FFI.

e.g. is 6502 assembly now portable that C64 and NES emulators are commonplace?

I would say yes! It’s just that portability is not the only thing required to make a VM spec useful.

But if you lacked other options, you could theoretically build gcc for 6502 assembly once, and then use the same binary to bootstrap the gcc on lots of different platforms, specifically thanks to the proliferation of NES emulators.

This would also only work if there is a standard NES API available in all the emulators that is rich enough to back a portable libc implementation. I have no idea about that part.