NobodyElse@sh.itjust.works
on 20 Jul 13:09
collapse
Performing procedural tasks using a statistical model of our language will never be reliable. There’s a reason why we use logical and proscriptive syntax when we want deterministic outcomes.
I expect what we will see are tools where the human manages high level implementation, and the agents are used to implement specific functionality that can be easily tested and verified. I can see something along the lines of a scene graph where you focus on the flow of the code, and farm off details of implementation of each step to a tool. As the article notes, these tools can already get over 90% degree accuracy in these scenarios.
NobodyElse@sh.itjust.works
on 20 Jul 16:13
collapse
I agree that this could be helpful for finally getting to that natural language programming paradigm that people have been hoping for. But there’s going to have to be something capable of logically implementing the low level code and that has to be more than just a statistical model of how people write code, as trained on a big collection of random repositories. (I’m open to being proved wrong about that!)
The 90% accuracy could just arise from the fact that the tests are against trivial or commonly solved tasks, so that the exact solutions to them exist in the training set. Anything novel will exist outside the training set and outside of the model.
I think it’s going to be humans that implement actually interesting code while LLMs handle common and tedious stuff. That’s the approach I’ve been using at work. When I need to crap out a UI based on some JSON payload, or make an HTTP endpoint, I let the LLM do it. When I have some actual business logic that’s domain specific, I write that myself. This allows me to focus on writing code that’s actually interesting, while the LLM does all the tedious work.
That’s where the rate of success becomes important. LLMs mostly produce decent code when applied to common cases like the examples I gave above. My experience is that vast majority of the time it’s as good as what you’d write, occasionally needing minor tweaks. However, there’s nothing forcing you to use the code they produce either. If the LLM stumbles, you can always fall back to writing the code by hand which leaves you no worse off than you would’ve been otherwise. It’s all about learning how the tool works and when to use it.
You’re absolutely saving time, checking that the code works is far less time consuming than writing it. Especially for stuff like UIs or service endpoints. I literally work with this stuff on daily basis, and I would never go back. There’s also another aspect to it which is that I personally find it makes my workflow more enjoyable. It lets me focus on things I actually want to work on, while automating a lot of boilerplate that I had to write by hand previously. Even if it wasn’t saving me much time, there’s a quality of life improvement here.
Yes, I’ve seen this as well. First of all, 16 devs is a tiny sample, a far bigger study would be needed to get any meaningful results here. Second, it really depends on how experienced people are at using these tools. It took me a while to identify patterns that actually work repeatably and develop intuition for cases where the model is most likely to produce good results.
threaded - newest
Performing procedural tasks using a statistical model of our language will never be reliable. There’s a reason why we use logical and proscriptive syntax when we want deterministic outcomes.
I expect what we will see are tools where the human manages high level implementation, and the agents are used to implement specific functionality that can be easily tested and verified. I can see something along the lines of a scene graph where you focus on the flow of the code, and farm off details of implementation of each step to a tool. As the article notes, these tools can already get over 90% degree accuracy in these scenarios.
I agree that this could be helpful for finally getting to that natural language programming paradigm that people have been hoping for. But there’s going to have to be something capable of logically implementing the low level code and that has to be more than just a statistical model of how people write code, as trained on a big collection of random repositories. (I’m open to being proved wrong about that!)
The 90% accuracy could just arise from the fact that the tests are against trivial or commonly solved tasks, so that the exact solutions to them exist in the training set. Anything novel will exist outside the training set and outside of the model.
I think it’s going to be humans that implement actually interesting code while LLMs handle common and tedious stuff. That’s the approach I’ve been using at work. When I need to crap out a UI based on some JSON payload, or make an HTTP endpoint, I let the LLM do it. When I have some actual business logic that’s domain specific, I write that myself. This allows me to focus on writing code that’s actually interesting, while the LLM does all the tedious work.
But doesn’t the LLM sometimes churn out tedious garbage that you have to fix, thus not actually saving time?
That’s where the rate of success becomes important. LLMs mostly produce decent code when applied to common cases like the examples I gave above. My experience is that vast majority of the time it’s as good as what you’d write, occasionally needing minor tweaks. However, there’s nothing forcing you to use the code they produce either. If the LLM stumbles, you can always fall back to writing the code by hand which leaves you no worse off than you would’ve been otherwise. It’s all about learning how the tool works and when to use it.
You have to check it every single time, though, erasing any time savings. You’re saving effort, maybe, but not time.
You’re absolutely saving time, checking that the code works is far less time consuming than writing it. Especially for stuff like UIs or service endpoints. I literally work with this stuff on daily basis, and I would never go back. There’s also another aspect to it which is that I personally find it makes my workflow more enjoyable. It lets me focus on things I actually want to work on, while automating a lot of boilerplate that I had to write by hand previously. Even if it wasn’t saving me much time, there’s a quality of life improvement here.
METR measured the speed of 16 developers working on complex software projects, both with and without AI assistance. After finishing their tasks, the developers estimated that access to AI had accelerated their work by 20% on average. In fact, the measurements showed that AI had slowed them down by about 20%.
Yes, I’ve seen this as well. First of all, 16 devs is a tiny sample, a far bigger study would be needed to get any meaningful results here. Second, it really depends on how experienced people are at using these tools. It took me a while to identify patterns that actually work repeatably and develop intuition for cases where the model is most likely to produce good results.
Well said. I understood next to none of it, but well said.