Even if you pin the seed and spin up your own local LLM, changes to continuous batching at the vLLM level or just a different CUDA driver version will completely break your bitwise float convergence. Reproducibility in ML generation is a total myth, in prod we only work with the final output anyway
Perfect analogy. Nobody cares how many times you googled "how to center a div" before finally writing proper CSS. Same goes for agents: I only care about the final architectural state and performance, not how the model brain-farted over trivial boilerplate because of a scuffed system prompt
The idea of "saving prompts for reproducibility" is dead on arrival. LLMs are non-deterministic by nature. In a year, they'll deprecate this model's API, and the new version will spit out completely different code with entirely new bugs for the exact same prompt. A prompt isn't source code, it's just a temporary crutch for stochastic generation. And if I have to read 50 pages of schizophrenic dialogue with an LLM just to understand why a specific function exists, that PR gets an instant reject. The artifact is and always will be readable code plus a sane commit message. Dumping a log of hallucinations will only make debugging a nightmare when this Frankenstein inevitably falls apart in prod tbh
If just 16 million examples were enough to significantly boost model quality (as Anthropic claims), it turns out that data quality beats quantity
Instead of vacuuming petabytes of trash from Common Crawl, you can just take high-quality distillate from a SOTA model and get comparable results. Bad news for anyone betting solely on massive compute clusters and closed datasets
He had the full source code of a working Linux driver that does exactly the same thing, just in a neighboring kernel dialect. The task was to translate, not invent. Sure, it's still impressive (given the difference in kernel APIs), but it's not the same as writing a driver from scratch using only a PDF datasheet. Now, when an AI takes an undocumented Chinese chip and writes a driver by sniffing the bus with a logic analyzer - then I'll call it "reasoning"
To be fair if you open up driver source code from the vendors themselves, it's often the same hell with magic numbers and lack of checks because "we know what the hardware will return". But you're right on the main point: AI writes C like a very confident junior who skipped memory safety lectures - it copies the style, but not the discipline. It works as long as you're on the "happy path", but debugging a kernel panic in code like that is going to be painful
I was personally surprised when the agent debugged kernel panics caused by its own code (many times by now). It just iterates from the stack traces and crash dumps.
The nice part is that, when you do see that the code smells — you ask the agent to rework it, focusing on specific problems. This is just code, and you don't need to dance around, hoping that AI will spill some "magic" at you.
Dumping panic traces to an agent works fine if it's just a vanilla page fault at an obvious address. But when your memory gets corrupted by some scuffed DMA sync or a race condition in an interrupt handler, the kernel panics a million clock cycles after the actual bug occurred. The dump is just pure garbage by then, and no LLM is going to untangle it because the root cause context literally isn't in the logs tbh
> The nice part is that, when you do see that the code smells — you ask the agent to rework it, focusing on specific problems.
I think that is the crux of the problem. How do you know code smell if you don't write it, and you don't read it? I'm pretty confident the spdx header isn't correct even.
Above it was said, that in a code review, an expert would ask the author to "justify and rework". Clearly, people have always been capable producing code, that wasn't great, regardless if they read the code or not.
I wouldn't call this "clean-room". The models were trained on all available open source, including that exact original Linux driver. Splitting sessions saves you from direct copy-paste in the current context window, but the weights themselves remember the internal code structure perfectly well. Lawyers still have to rack their brains over this, but for now, it looks more like license laundering through the neural net's latent space than true reverse engineering
reply