Hacker Newsnew | past | comments | ask | show | jobs | submit | veunes's commentslogin

Even if you pin the seed and spin up your own local LLM, changes to continuous batching at the vLLM level or just a different CUDA driver version will completely break your bitwise float convergence. Reproducibility in ML generation is a total myth, in prod we only work with the final output anyway

Perfect analogy. Nobody cares how many times you googled "how to center a div" before finally writing proper CSS. Same goes for agents: I only care about the final architectural state and performance, not how the model brain-farted over trivial boilerplate because of a scuffed system prompt

The idea of "saving prompts for reproducibility" is dead on arrival. LLMs are non-deterministic by nature. In a year, they'll deprecate this model's API, and the new version will spit out completely different code with entirely new bugs for the exact same prompt. A prompt isn't source code, it's just a temporary crutch for stochastic generation. And if I have to read 50 pages of schizophrenic dialogue with an LLM just to understand why a specific function exists, that PR gets an instant reject. The artifact is and always will be readable code plus a sane commit message. Dumping a log of hallucinations will only make debugging a nightmare when this Frankenstein inevitably falls apart in prod tbh

This is something that should be possible in principle, since the machines underneath are deterministic, it’s just a limitation of the implementation.

I think that's mostly right, but Google still is part of the problem because it normalized the idea that the tradeoff should be invisible

A lot of "this service is terrible" turns out to be "I've accumulated ten years of bad habits around this service"

The hardest part of leaving big platforms usually isn't technical, it's psychological

If just 16 million examples were enough to significantly boost model quality (as Anthropic claims), it turns out that data quality beats quantity

Instead of vacuuming petabytes of trash from Common Crawl, you can just take high-quality distillate from a SOTA model and get comparable results. Bad news for anyone betting solely on massive compute clusters and closed datasets


He had the full source code of a working Linux driver that does exactly the same thing, just in a neighboring kernel dialect. The task was to translate, not invent. Sure, it's still impressive (given the difference in kernel APIs), but it's not the same as writing a driver from scratch using only a PDF datasheet. Now, when an AI takes an undocumented Chinese chip and writes a driver by sniffing the bus with a logic analyzer - then I'll call it "reasoning"

To be fair if you open up driver source code from the vendors themselves, it's often the same hell with magic numbers and lack of checks because "we know what the hardware will return". But you're right on the main point: AI writes C like a very confident junior who skipped memory safety lectures - it copies the style, but not the discipline. It works as long as you're on the "happy path", but debugging a kernel panic in code like that is going to be painful

I was personally surprised when the agent debugged kernel panics caused by its own code (many times by now). It just iterates from the stack traces and crash dumps. The nice part is that, when you do see that the code smells — you ask the agent to rework it, focusing on specific problems. This is just code, and you don't need to dance around, hoping that AI will spill some "magic" at you.

Dumping panic traces to an agent works fine if it's just a vanilla page fault at an obvious address. But when your memory gets corrupted by some scuffed DMA sync or a race condition in an interrupt handler, the kernel panics a million clock cycles after the actual bug occurred. The dump is just pure garbage by then, and no LLM is going to untangle it because the root cause context literally isn't in the logs tbh

> The nice part is that, when you do see that the code smells — you ask the agent to rework it, focusing on specific problems.

I think that is the crux of the problem. How do you know code smell if you don't write it, and you don't read it? I'm pretty confident the spdx header isn't correct even.


Above it was said, that in a code review, an expert would ask the author to "justify and rework". Clearly, people have always been capable producing code, that wasn't great, regardless if they read the code or not.

I wouldn't call this "clean-room". The models were trained on all available open source, including that exact original Linux driver. Splitting sessions saves you from direct copy-paste in the current context window, but the weights themselves remember the internal code structure perfectly well. Lawyers still have to rack their brains over this, but for now, it looks more like license laundering through the neural net's latent space than true reverse engineering

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: