All those things are true, but there are a few things to add to it.
They are much much better at writing plausible looking code or providing plausible sounding answers that look correct but aren’t than any junior developer. They are much better at writing subtle bugs that won’t show up for days and will take hours to debug than any junior developer is.
You can’t mentor them into being a senior developer.
Ya, but anyone who is using an LLM to write just code and isn’t also using them to write tests is nuts: they are just going to get garbage out. An LLM can totally write correct code, just not without some kind of feedback loop going on, even if that feedback loop is some form of oracle testing (where the LLM is also writing the tests, and the tests can be wrong themselves). They are also great enumerating edge cases, which feeds nicely into test coverage. 80% of your prompting work is in getting the LLM to write tests and get good coverage, maybe 20% is getting it to write code.
Junior developers (human) can kind of get by without testing, at least in the short term. The LLM can never get by without testing unless it’s some simple one off logic.
It’s definitely true that LLMs require tests, but tests aren’t an antidote to what I’m taking about because LLMs are also good at writing plausible looking tests that are actually terrible.
They’re also very bad at understanding what edge cases are useful to test and what edge cases can be ignored.
I had a situation just the other day where the LLM produced tests for scores of edge cases, but a managed to leave out a specific sequence that caused a very difficult to diagnose bug. I’m assuming because it decided that sequence wasn’t possible.
But it did include dozens of redundant cases that no sane human ever would have.
It’s basically oracle testing (because the tests can be wrong), you can also have your tests focus on edge cases to be more effective, you can also hone in and specify near pass/fail edges (but I guess this would be problem specific). LLMs are good at listing those, this one simple step prevents it from just testing random stuff with no concept of coverage. Really it’s up to the person writing the prompts to micromanage the LLm, if they aren’t very experienced they won’t get very good results with it, they basically need to outlay the test case selection strategy and then maybe even post annotate what cases they think are critical and need to be covered vs what is just nice to cover.
I just went through this a couple of weeks ago. My test cases were ok but missing some basic edge cases. I asked it to enumerate the edge cases but that didn’t work well either. I then asked it to list near positive and near negative edge cases, and that worked well for my problem. For bugs I found in the code that weren’t covered, I asked the LLM to add them and/or mark them critical to cover (the case might had been there already but I only let it generate 3 tests with a limited size data set, so not all cases would be covered). That worked well, although I’m sure I could do better (probably by generating more tests, but many current approach is slow because each artifact is generated in a context isolated agent, so I have to speed that up first).
They are much much better at writing plausible looking code or providing plausible sounding answers that look correct but aren’t than any junior developer. They are much better at writing subtle bugs that won’t show up for days and will take hours to debug than any junior developer is.
You can’t mentor them into being a senior developer.