More

bgirard · 2026-02-10T20:10:55 1770754255

Your experience sounds exactly like mine. My son is very autistic as well. I've had to cut off friends with families because either their didn't understand meltdown and were incredibly judgy because they were blaming my parenting for his ASD meltdowns, or others because my autistic son was a "bad influence". God forbid their (later diagnosed) kid have some exposure to a child with different neurodiversities.

That's not even going into my traumatic health care experience to getting my son help when he needed it.

So now I have all the hardships of raising a family, and I'm restricted friendship within the small ND accepting community of my area. So my support network is incredibly small and I barely get any support. It sucks.

Reading the responses to your story that are nitpicking it over your daycare experience is a perfect representation of the problems that families face.

bgirard · 2026-02-10T15:51:29 1770738689

That's a good question. As someone bootstraping a few projects on Vercel this post has me looking over at the pricing sheet more closely.

bgirard · 2026-02-09T06:25:42 1770618342

This to me sounds a lot like the SpaceX conversation:

- Ohh look it can [write small function / do a small rocket hop] but it can't [ write a compiler / get to orbit]!

- Ohh look it can [write a toy compiler / get to orbit] but it can't [compile linux / be reusable]

- Ohh look it can [compile linux / get reusable orbital rocket] but it can't [build a compiler that rivals GCC / turn the rockets around fast enough]

- <Denial despite the insane rate of progress>

There's no reason to keep building this compiler just to prove this point. But I bet it would catch up real fast to GCC with a fraction of the resources if it was guided by a few compiler engineers in the loop.

We're going to see a lot of disruption come from AI assisted development.

jeffreygoesto · 2026-02-09T07:23:03 1770621783

All these people that built GCC and evolved the language did not have the end result in their training set. They invented it. They extrapolated from earlier experiences and knowledge, LLMs only ever accidentally stumble into "between unknown manifolds" when the temperature is high enough, they interpolate with noise (in so many senses). The people building GCC together did not only solve a to technical problem. They solved a social one, agreeing on what they wanted to build, for what and why. LLMs are merely copying these decisions.

bgirard · 2026-02-09T08:09:36 1770624576

That's true and I fully agree. I don't think LLMs' progress in writing a toy C compiler diminishes the achievements that the GCC project did.

But also we've just witnessed LLMs go from being a glorified line auto-complete tool to it writing a C compiler in ~3 years. And I think that's something. And noting how we keep moving the goal post.

direwolf20 · 2026-02-09T11:34:42 1770636882

GP: "it didn't write a C compiler, it copied other compilers. Writing one from scratch is a lot harder."

You: "but look! It wrote a C compiler!"

bwfan123 · 2026-02-09T17:37:01 1770658621

The pattern matching rote-student is acing the class. No surprises here. There is no need to understand the subject from first principles to ace tests. Majority of high-school and college kids know this.

yourapostasy · 2026-02-09T14:26:15 1770647175

> LLMs are merely copying these decisions.

This I strongly suspect is the crux of the boundaries of their current usefulness. Without accompanying legibility/visibility into the lineage of those decisions, LLM's will be unable to copy the reasoning behind the "why", missing out on a pile of context that I'm guessing is necessary (just like with people) to come up to speed on the decision flow going forward as the mathematical space for the gradient descent to traverse gets both bigger and more complex.

We're already seeing glimmers of this as the frontier labs are reporting that explaining the "why" behind prompts is getting better results in a non-trivial number of cases.

I wonder whether we're barely scratching the surface of just how powerful natural language is.

itsyonas · 2026-02-09T07:48:16 1770623296

All right, but perhaps they should also list the grand promises they made and failed to deliver on. They said they would have fully self-driving cars by 2016. They said they would land on Mars in 2018, yet almost a decade has passed since then. They said they would have Tesla's fully self-driving robo-taxis by 2020 and human-to-human telepathy via Neuralink brain implants by 2025–2027.

> - <Denial despite the insane rate of progress>

Sure, but not by what was actually promised. There may also be fundamental limitations to what the current architecture of LLMs can achieve. The vast majority of LLMs are still based on Transformers, which were introduced almost a decade ago. If you look at the history of AI, it wouldn't be the first time that a roadblock stalled progress for decades.

> But I bet it would catch up real fast to GCC with a fraction of the resources if it was guided by a few compiler engineers in the loop.

Okay, so at that point, we would have proved that AI can replicate an existing software project using hundreds of thousands of dollars of computing power and probably millions of dollars in human labour costs from highly skilled domain experts.

jopsen · 2026-02-09T21:22:20 1770672140

There's an argument to be made that replicating existing software is extremely useful.

Most of the time when you're writing a compiler for a new language, you'll be doing things that have been done before.

Because most of the concepts in your language are brought along from somewhere else.

That said: I'd always want a compiler and language designs to be well considered. Ideally, the authors have some proofs of soundness in their heads.

Perhaps LLM will make formal verification more feasible (from a cost perspective) and then our mind about what reliable software is might change.

raincole · 2026-02-09T06:43:32 1770619412

> the insane rate of progress

Yeah but the speed of progress can never catch the speed of a moving goalpost!

wrxd · 2026-02-09T07:02:12 1770620532

What about the hype? If you claim your LLM generated compiler is functionally on par with GCC I’d expect it to match your claim.

I still won’t use it while it also matches all the non-functional requirements but you’re free to go and recompile all the software you use with it.

friendzis · 2026-02-09T06:55:06 1770620106

> Yeah but the speed of progress can never catch the speed of a moving goalpost!

How do you like those coast-to-coast self drives since the end of 2017?

samultio · 2026-02-09T07:58:41 1770623921

Training data only teaches it how to reach the goalpost, not how to overtake it.

codethief · 2026-02-09T11:46:11 1770637571

Are we sure about that? I mean, we have seen that LLMs are able to generalize to some degree. So I don't see a reason why you couldn't put an agent in a loop with a profiler and have it try to optimize the code. Will it come up with entirely novel ideas? Unlikely. Could it potentially combine existing ideas in interesting, novel ways that would lead to CCC outperforming GCC? I think so. Will it get stuck along the way? Almost certainly.

andriamanitra · 2026-02-09T12:44:43 1770641083

Would you want it to? The further the goal posts are the more progress we are making, and that's good, no? Trying to make it into a religious debate between believers and non-believers is silly. Neither side can predict the future, and, even if they could, winning the debate is not worth anything!

What is interesting is what can do with LLMs today and what we would like them to be able to do tomorrow so we can keep developing them into a good direction. Whether or not you (or I) believe it can do that thing tomorrow is thoroughly uninteresting.

gjulianm · 2026-02-09T11:31:58 1770636718

The goalpost is not moving. The issue is that AI generates code that kinda looks ok but usually has deep issues, specially the more complex the code is. And that's not being really improved.

Ygg2 · 2026-02-09T06:44:26 1770619466

You can be wrong on every step of your approximation and still be right in the aggregate. E.g. order of magnitude estimate, where every step is wrong but mistakes cancel out.

Human crews on Mars is just as far fetched as it ever was. Maybe even farther due to Starlink trying to achieve Kessler syndrome by 2050.

forty · 2026-02-09T06:59:25 1770620365

There are two questions which can be asked for both. The first one is "can these tech can achieve their goals?" which is what you seem debating. The other question is "is a successful outcome of these tech desirable at all?". One is making us pollute space faster than ever, as if we did not fuck the rest enough. They other will make a few very rich people even richer and probably everyone else poorer.

Interesting that people call this "progress" :)

littlestymaar · 2026-02-09T07:03:00 1770620580

> This to me sounds a lot like the SpaceX conversation

The problem is that it is absolutely indiscernible from the Theranos conversation as well…

If Anthropic stopped making lies about the current capability of their models (like “it compiles the Linux kernel” here, but it's far from the first time they do that), maybe neutral people would give them the benefit of the doubt.

For one grifter that happen to succeed at delivering his grandiose promises (Elon), how many grifters will fail?

gordonhart · 2026-02-09T13:54:39 1770645279

The difference I see is that, after "get to orbit", the goalposts for SpaceX are things that have never been done before, whereas for LLMs the goalposts are all things that skilled humans have been able to do for decades.

benreesman · 2026-02-09T07:42:03 1770622923

AI assist in software engineering is unambiguously demonstrated to some done degree at this point: the "no LLM output in my project" stance is cope.

But "reliable, durable, scalable outcomes in adversarial real-world scenarios" is not convincingly demonstrated in public, the asterisks are load bearing as GPT 5.2 Pro would say.

That game is still on, and AI assist beyond FIM is still premature for safety critical or generally outcome critical applications: i.e. you can do it if it doesn't have to work.

I've got a horse in this race which is formal methods as the methodology and AI assist as the thing that makes it economically viable. My stuff is north of demonstrated in the small and south of proven in the large, it's still a bet.

But I like the stock. The no free lunch thing here is that AI can turn specifications into code if the specification is already so precise that it is code.

The irreducible heavy lift is that someone has to prompt it, and if the input is vibes the output will be vibes. If the input is zero sorry rigor... you've just moved the cost around.

The modern software industry is an expensive exercise in "how do we capture all the value and redirect it from expert computer scientists to some arbitrary financier".

You can't. Not at less than the cost of the experts if the outcomes are non-negotiable.

a1o · 2026-02-09T12:20:55 1770639655

What is FIM ?

delaminator · 2026-02-09T10:32:57 1770633177

In 1908 the Model T could do 45mph.

In 1935 the Auburn 851 S/C Speedster hit 100mph

In 1955 the Mercedes-Benz 300 SL Gullwing did 161mph

In 2025 the Yangwang U9 Xtreme hit 308mph

progress is a decaying exponential - Tsiolkovsky's tyranny

CleaveIt2Beaver · 2026-02-09T14:48:34 1770648514

And all these improvements past 1935 have been rendered irrelevant to the daily driver by safety regulations (I'll limit this claim to most of the continental US to avoid straying beyond my experience.)

a1o · 2026-02-09T12:23:20 1770639800

These specific points look like a line if you plot

bgirard · 2026-02-06T21:41:08 1770414068

I like how the author shared the prompt + conversation transcripts. I wish OAI / Anthropic would do that when they share content demos.

bgirard · 2026-02-05T19:58:31 1770321511

Doesn't feel like a useful data point without more context. For some hard bugs I'd be thrilled to wait 30 minutes for a fix, for a trivial CSS fix not so much. I've spent weeks+ of my career fix single bugs. Context is everything.

smith7018 · 2026-02-05T20:21:10 1770322870

Sure, but I've never experienced a 20 minute wait with CC before. It was an architectural question but it would have taken a couple minutes with a definitive answer on 4.5.

sejje · 2026-02-06T19:11:11 1770405071

> I've spent weeks+ of my career fix single bugs.

Same, same. It's not a useful data point at all.

bug: llm alignment

timeframe to fix : probably never

bgirard · 2026-02-05T19:49:40 1770320980

> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

[1] https://factory-gpt.vercel.app/

veb · 2026-02-05T19:54:07 1770321247

I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.

bgirard · 2026-02-05T20:03:00 1770321780

Thank you. There's a demo save to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.

This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.

gspetr · 2026-02-06T03:46:54 1770349614

Any estiimates on how much it cost you? In terms of total real world time, money, and time spent by the agents.

bgirard · 2026-02-06T05:38:12 1770356292

About ~$300: $200 for Claude max subscription $20 for Vercel $20 for Codex $20 for Meshy

I think these days the $200 Max subscription wouldn't be needed. I bet with these latest models you can make due with mixing two $20/mo subscriptions.

Real time was 2 weeks of watching the agents while watching TV and playing games, waiting for limit resets, etc... Very little decided focused time.

bgirard · 2026-02-03T17:24:32 1770139472

The switching cost is so low that I find it's easier and better value to have two $20/mo subscription from different providers than a $200/mo subscription with the frontier model of the month. Reliability and model diversity are a bonus.

davedx · 2026-02-04T12:21:31 1770207691

Yes that's exactly what I have too.

bgirard · 2026-01-29T17:01:56 1769706116

> malicious

It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.

I care about -expected- performance when picking which model to use, not optimal benchmark performance.

Aurornis · 2026-01-29T17:38:58 1769708338

Non-determinism isn’t the same as degradation.

The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.

In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.

bonoboTP · 2026-01-29T20:05:22 1769717122

This has nothing to do with overloading. The suspicion is that when there is too much demand (or they just want to save costs), Anthropic sometimes uses a less capable (quantized, distilled, etc) version of the model. People want to measure this so there is concrete evidence instead of hunches and feelings.

To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.

F7F7F7 · 2026-01-29T22:54:37 1769727277

“Just drink the water, it’s all water.”

novaleaf · 2026-01-29T17:41:10 1769708470

this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context.

bgirard · 2026-01-28T20:59:05 1769633945

Revenue per corporate employee are both ~1.5m/corp employee. I think it really comes down to how the company is managed and culture.

socalgal2 · 2026-01-28T21:30:18 1769635818

Nintendo Japan pays its engineers ~$62k a year

https://www.portal.e2r.jp/fixurl/nintendo_career_job/id/4/2?...

Amazon starts at 3x that and goes up

https://www.levels.fyi/companies/amazon/salaries

virtue3 · 2026-01-28T22:09:36 1769638176

Some additional context to explain the numbers:

Game industry in general pays like shit - japanese software engineering pays even worse, so double negative modifiers on salaries in this comparison.

source: worked in games, worked at japanese tech company with us division.

no_wizard · 2026-01-29T01:12:21 1769649141

You need to compare the US division to get a more accurate comparison. Nintendo US pays quite well.

socalgal2 · 2026-01-29T02:28:52 1769653732

I don't not. The person making the claim they weren't going to fire anyone was the president of Nintendo Japan

bgirard · 2026-01-27T16:42:47 1769532167

When you have a personality disorder like NPD, you'll believe to your core that every criticism of you is a lie.

When you're in an abusive relationship they say intentions don't matter, only impact does. Because victims often focus on the intentions of their abuser and stay in the cycle of abuse.

Let me repeat it, intentions don't matter, only impact does.