I think the parent comment is saying “why did the agent produce this big, and why wants it caught”, which is a separate problem from what granular commits solve, of finding the bug in the first place.
There is no "why." It will give reasons but they are bullshit too. Even with the prompt you may not get it to produce the bug more than once.
If you sell a coding agent, it makes sense to capture all that stuff because you have (hopefully) test harnesses where you can statistically tease out what prompt changes caused bugs. Most projects wont have those and anyway you don't control the whole context if you are using one of the popular CLIs.
If I have a session history or histories, I can (and have!) mine them to pinpoint where an agent either did not implement what it was supposed to, or understand who asked for a certain feature an why, etc. It complements commits, sessions are more like a court transcript of what was said / claimed (session) and then you can compare that to what was actually done (commits).
no you look at the session to understand what the context was for the code change -- what did you _ask_ the llm to do? did it do it? where did a certain piece of logic go wrong? Session history has been immensely useful to me and it serves as an important documentation of the entire flow of the project. I don't think people should look at session histories at all unless they need to.
I'm not quite sure I understand the logic of this and how people don't see that these claims of "well now everyone is going to be dumber because they don't learn" has been a refrain literally every time a major technological / Industrial Revolution happens. Computers? The internet? Calculators?
The skills we needed before are just no longer as relevant. It doesn't mean the world will get dumber, it will adapt to the new tooling and paradigm that we're in. There are always people who don't like the big paradigm change, who are convinced it's the end of the "right" way to do things, but they always age terribly.
I find I learn an incredible amount from using AI + coding agents. It's a _different_ experience, and I would argue a much more efficient one to understand your craft.
100%. I have been learning so much faster as the models get better at both understanding the world and how to explain it me at whatever level I am ready for.
Using AI as just a generator is really missing out on a lot.
Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5
> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.
But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.
Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment
Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.
Lots of uses for small, medium, and larger models they all have important places!!
I tried this today with this username and other usernames on this and other platforms with Claude Code
- First it told me it couldn't do this, that this was doxxing
- I said: its for me, I want to see if I can be deanonymized
- Claude says: oh ok sure and proceeds to do it
It analyzed my profile contents and concluded that there were likely only 5 - 10 people in the world that would match this profile (it pulled out every identifying piece of information extremely accurately). Basically saying: I don't have access to LinkedIn but if I did I could find you in like 5 seconds.
Anyway, like others have said: this type of capability has always been around for nation state actors (it's just now frighteningly more effective), but e.g. for your stalker? For a fraudster or con artist? Everyone has a tremendous unprecedented amount of power at their fingertips with very little effort needed.
World models are not a new idea, it comes from the "model-free" and "model-based" reinforcement learning paradigms that have been around forever
Model-free paradigms (which we do now without world models) does not actually model what _happens_ when you take an action, they simply model how good or bad an action is. This is highly data inefficient but asymptotically performs much better than model-based RL because you don't have modeling biases.
Model-based RL, where world-models come in, models the transition matrix T(s, a, s') meaning, I'm in state s and I take action a, what is my belief about my new state? By doing this you can do long-term planning, so it's not just useful for robotics and video generation but for reasoning and planning more broadly. It's also highly data efficient, and right now, for robotics, that is absolutely the name of the game.
What you will see is: approximately zero robots, then approximately one crappy robot (once you get performance + reliability to jusssst cross the boundary where you can market it, even at a loss! and people will buy it and put it in their homes). Once that happens you get the magic: data flywheel for robotics, and things start _rapidly_ improving.
Robotics is where it is because it lacks the volume of data we have on the internet. for robotics today it's not only e.g. egocentric video that we need but also _sensor-specific_ and _robot-specific_ data (e.g. robot A has a different build + components than robot B)
Yes perfect advice -- negotiate from a position of leverage and always ask for that little bump at the end.
In my experience, not talking about salary early kind of sets everyone up to waste their time. One time it ended up with a full interview process that went very well for a job I thought would be perfect in an industry that _should_ have outstanding pay, and the resulting offer that was lower than my current role, paid hourly without benefits with a vague promise to later be an FTE; not only did we all waste our time, I was pretty upset about it. When I sent an email to the hiring manager they said "well you never told us your expectations" -- now the guy was dumb, he _knew_ I had a good job already, the comp he was offering was well below industry standard, and he knew my background, but nevertheless that is where a lot of hiring folks heads are at and it makes total sense: they want to get a good deal just like you do.
Asking for salary band is good, especially earlier in your career, but to me it's now kind of irrelevant -- for the same reason you will go high, they will try to go low. I have a price I will be happy at, I say a number higher in the beginning but say depending on how everything goes there may end up being flexibility, and that I look at the entire package holistically. This is just to assess: "is it worth us continuing to engage". Not doing this would have wasted a colossal amount of time.
I'm now in a position where I know where salaries generally are in different parts of the industry, and I can set a price based on what I expect and what my current role is, and I explain my reasoning. But yes: it depends so much on the situation. Benefits good? Growth potential at a startup good? Do I believe in the mission and that the founder won't abandon for an acquihire and tank my equity? Etc.
It's shocking to me that people make this claim as if humans, especially in some legacy accounting system, would somehow be much better at (1) recognizing their mistakes, and (2) even when they don't, not fudge-fingering their implementation. Like the criticisms of agents are valid, but the incredulity that they will ever be used in production or high risk systems to me is just as incredible. Of course they will -- where is Opus 4.6 compared to Sonnet 4? We've hit an inflection point where replacing hand coding with an agent and interacting only via prompt is not only doable, highly skilled people are already routinely doing it. Companies are already _requiring_ that people do it. We will then hit an inflection point at some time soon where the incredulity at using agents even in the highest stakes application will age really really poorly. Let's see!
Your point is the speculative one, though. We know humans can and have built incredibly complex and reliable systems. We do not have the same level of proof for LLMs.
Claims like your should wait at least 2-3 years, if not 5.
That is also speculative. Well let's just wait and see :) but the writing is on the wall. If your criticism is where we're at _now_ and whether or not _today_ you should be vibe coding in highly complex systems I would say: why not? as long as you hold that code to the same standard as human written code, what is the problem? If you say "well reviews don't catch everything" ok but the same is true for humans. Yes large teams of people (and maybe smaller teams of highly skilled people) have built wonderfully complex systems far out of reach of today's coding agents. But your median programmer is not going to be able to do that.
I just cannot fathom how people can say something like this today, agentic tools have now passed an inflection point. People want to point out the short comings and fully ignore that you can now make a fully functioning iPhone app in a day without knowing swift or front end development? That I can at my company do two projects simultaneously, both of them done in about 1/4 the time and one would not have even been attempted before due to the SWE headcount you would have to steal. There are countless examples I have in my own personal projects that just are such an obvious counter example to the moaning “I appreciate the craft” people or “yea this will never work because people still have to read the code” (today sure and this is now made more manageable by good quality agents, tomorrow no. No you won’t need to read code.)
I've found that the effort required to get a good outcome is roughly equal to the effort of doing it myself.
If I do it myself, I get the added bonus of actually understanding what the code is doing, which makes debugging any issues down the line way easier. It's also in generally better for teams b/c you can ask the 'owner' of a part of the codebase what their intuition is on an issue (trying to have AI fill in for this purpose has been underwhelming for me so far).
Trying to maintain a vibecoded codebase essentially involves spelunking though a non-familliar codebase every time manual action is needed to fix an issue (including reviewing/verifying the output of an AI tool's fix for the issue).
(For small/pinpointed things, it has been very good. e.g.: write a python script to comb through this CSV and print x details about it/turn this into a dashboard)
There are other things very good "at some range of common tasks". For example, stackoverflow snippets, libraries, bash spaghetti and even some no-code/low-code tools.
In sonnet 4 and even 4.5 I would have said you are absolutely right, and in many cases it slows you down especially when you don’t know enough to sniff trouble.
Opus 4.5 and 4.6 is where those instances have gone down, waaay down (though still true). Two personal projects I had abandoned after sonnet built a large pile of semi working cruft it couldn’t quite reason about, opus 4.6 does it in almost one shot.
You are right about learning but consider: you can educate yourself along the way — in some cases it’s no substitute for writing the code yourself, and in many cases you learn a ton more because it’s an excellent teacher and you can try out ideas to see which work best or get feedback on them. I feel I have learned a TON about the space though unlike when I code it myself I may not be extremely comfortable with the details. I would argue we are about 30% of the way to the point where it’s not even no longer relevant it’s a disservice to your company to be writing things yourself.
You’re talking about a 1970s satellite? I guess you win the argument?
Blog: I use AI to make and blog developers are using agentic tools
X-ray machine: again a little late here, plus if you want to start dragging in places that likely have a huge amount of beaurocracy I don’t know that that’s very fair
Firmware in your toaster: cmon these are old basic things, if it’s new firmware maybe? But probably not? These are not strong examples
NYSE to action on stock trades; no they don’t use AI to action on stock trades (that would be dumb and slow and horribly inefficient and non-deterministic), but may very well now be using AI to work on the codebase that does
Let’s try to find maybe more impactful examples than small embodied components in toasters and telescopes, 1970s era telescopes that are already past our solar system.
Im saying you’re missing the point and the spirit of the argument. Yes, you are right, voyager doesn’t use agentic AI! I don’t even think the other examples you used are as agentic free as you think. They may or may not be! What’s the point you want to make?
Your point about the overwhelming proliferation of AI tools and not knowing which are worth any attention and which are trash is very true I feel that a lot today (my solution is basically to just lean into one or two and ask for recommendations on other tools with mixed success).
The “I’m so tired of being told we’re in another paradigm shift” comments are widely heard and upvoted on HN and are just so hard to comprehend today. They are not seeing the writing on the wall and following where the ball is going to be even in 6-12 months. We have scaling laws, multiple METR benchmarks, internal and external evals of a variety of flavors.
“Tools like codex can be useful in small doses” the best and most prestigious engineers I know inside and outside my company do not code virtually at all. I’m not one of them but I also do not code at all whatsoever. Agents are sufficiently powerful to justify and explain themselves and walk you through as much of the code as you want them to.
Yeah, I’m not disputing that AI-assisted engineering is a real shift. It obviously is.
My issue is that we’ve now got a million secondary “paradigm shifts” layered on top: agent frameworks, orchestration patterns, prompt DSLs, eval harnesses, routing, memory, tool calling, “autonomous” workflows… all presented like you’re behind if you’re not constantly replatforming your brain.
Even if the end-state is “engineers code less”, the near-term reality for most engineers is still: deliver software, support customers, handle incidents, and now also become competent evaluators of rapidly changing bot stacks. That cognitive tax is brutal.
So yes, follow where the ball is going. I am. I’m just not pretending the current proliferation is anything other than noisy and expensive to keep up with.
reply