More

dcre · 2026-02-09T14:12:51 1770646371

Just because there is someone who could understand a given system, that doesn’t mean there is anyone who actually does. I take the point to be that existing software systems are not understood by anyone most of the time.

dcre · 2026-02-08T23:12:29 1770592349

“Shrinking since the election”, while technically true, is misleading because the election is when bsky experienced a massive spike in usage that was well over double the average before the election. Usage has been gradually decaying since then to a steady level much higher than it was before the election.

If you zoom out to a few years you can see the same pattern over and over at different scales — big exodus event from Twitter followed by flattening out at level that is lower than the spike but higher than the steady state before the spike. At this point it would make sense to say this is just how Bluesky grows.

https://bsky.jazco.dev/stats

Besides that, the entire point of this project is to increase the barrier to entry for potential contributors (while ideally giving good new people a way in). So I really don’t think they’re worried about this problem.

gruez · 2026-02-09T00:03:27 1770595407

>At this point it would make sense to say this is just how Bluesky grows.

>https://bsky.jazco.dev/stats

If you zoom out the graph all the way you'll see that it's a decline for the past year. The slight uptick in the past 1-2 months can probably be attributed to other factors (eg. ICE protests riling the left up) than "[filter bubble] is how bluesky grows".

dcre · 2026-02-09T00:58:42 1770598722

That’s what I said: it’s technically true but misleading.

dcre · 2026-02-08T01:49:20 1770515360

Yes, but GPT-5.2 and Codex were widely considered slower than Opus before that. They still feel very slow, at least on high. I should give medium a try more often.

dcre · 2026-02-06T13:42:49 1770385369

I think this is underrating the role of intuition in working effectively with deterministic but very complex software systems like operating systems and compilers. Determinism is a red herring.

dcre · 2026-02-06T03:30:52 1770348652

This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.

elevation · 2026-02-06T03:58:40 1770350320

> it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there

The quality of the existing code base makes a huge difference. On a recent greenfield effort, Claude emitted an MVP that matched the design semantics, but the code was not up to standards. For example, it repeatedly loaded a large file into memory in different areas where it was needed (rather than loading once and passing a reference.)

However, after an early refactor, the subsequently generated code vastly improved. It honors the testing and performance paradigms, and it's so clean there's nothing for the linter to do.

thesz · 2026-02-06T06:42:43 1770360163

  > the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

This one was on HN recently: https://spectrum.ieee.org/ai-coding-degrades

Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.

dcre · 2026-02-06T13:46:57 1770385617

Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

And their “explanation” blaming the training data is just a guess on their part, one that I suspect is wrong. There is no argument given that that’s the actual cause of the observed phenomenon. It’s a just-so story: something that sounds like it could explain it but there’s no evidence it actually does.

My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said. I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

thesz · 2026-02-06T16:15:01 1770394501

  > Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

I see "No True Scotsman" argument above.

  > My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said.

Reinforcement learning reinforces what is already in the LM, makes width of search path of possible correct answer narrower and wider search path in not-RL-tuned base models results in more correct answers [1].

[1] https://openreview.net/forum?id=4OsgYD7em5

  > I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

The sources of training data already were the reasons for allegations, even leading to lawsuits. So I would suspect that no engineer from any LLM company would disclose anything on their sources of training data besides innocently sounding "synthetic data verified by ourselves."

From the days I have worked on blockchains, I am very skeptical about any company riding any hype. They face enormous competition and they will buy, borrow or steal their way to try to not go down even a little. So, until Anthropic opens the way they train their model so that we can reproduce their results, I will suspect they leaked test set into it and used users code generation conversation as new source of training data.

dcre · 2026-02-06T21:04:20 1770411860

That is not what No True Scotsman is. I’m pointing out a bad argument with weak evidence.

thesz · 2026-02-06T21:34:51 1770413691

  >>> It is a very contrived under-specified prompt.

No True Prompt can be such contrived and underspecified.

The article about degradation is a case study (single prompt), weakest of the studies in hierarchy of knowledge. Case studies are basis for further, more rigorous studies. And author took the time to test his assumptions and presented quite clear evidence that such degradation might be present and that we should investigate.

dcre · 2026-02-07T03:18:42 1770434322

We have investigated. Millions of people are investigating all the time and finding that the coding capacity has improved dramatically over that time. A variety of very different benchmarks say the same. This one random guy’s stupid prompt says otherwise. Come on.

thesz · 2026-02-07T07:54:06 1770450846

As far as I remember, article stated that he found same problematic behavior for many prompts, issued by him and his colleagues. The "stupid prompt" in article is for demonstration purposes.

dcre · 2026-02-09T14:15:06 1770646506

But that’s not an argument, that’s just assertion, and it’s directly contradicted by all the more rigorous attempts to do the same thing through benchmarks (public and private).

nextos · 2026-02-06T04:32:38 1770352358

Progress with RL is very interesting, but it's still too inefficient. Current models do OK on simple boring linear code. But they output complete nonsense when presented with some compact but mildly complex code, e.g. a NumPyro model with some nesting and einsums.

For this reason, to be truly useful, model outputs need to be verifiable. Formal verification with languages like Dafny , F*, or Isabelle might offer some solutions [1]. Otherwise, a gigantic software artifact such as a compiler is going to have a critical correctness bugs with far-fetched consequences if deployed in production.

Right now, I think treating a LLM like something different than a very useful information retrieval system with excellent semantic capabilities is not something I am comfortable with.

[1] https://risemsr.github.io/blog/2026-02-04-nik-agentic-pop

dcre · 2026-02-06T13:50:58 1770385858

Human-written compilers have bugs too! It takes decades of use to iron them out, and we’re introducing new ones all the time.

dcre · 2026-02-04T15:36:24 1770219384

Fine article but a very important fact comes in at the end — the author has a human personal assistant. It doesn't fundamentally change anything they wrote, but it shows how far out of the ordinary this person is. They were a Thiel Fellow in 2020 and graduated from Phillips Exeter, roughly the most elite high school in the US.

ryukoposting · 2026-02-04T16:08:58 1770221338

The screenshots of price checks for a hotel charging $850 a night is what tipped me off. The reservations at expensive bay area restaurants, too.

I have a guess for why this guy is comfortable letting clawdbot go hog-wild on his bank account.

RC_ITR · 2026-02-04T16:16:49 1770221809

Yeah, I've found AI 'miracle' use-cases like these are most obvious for wealthy people who stopped doing things for themselves at some point.

Typing 'Find me reservations at X restaurant' and getting unformatted text back is way worse than just going to OpenTable and seeing a UI that has been honed for decades.

If your old process was texting a human to do the same thing, I can see how Clawdbot seems like a revolution though.

Same goes for executives who vibecode in-house CRM/ERP/etc. tools.

We all learned the lesson that mass-market IT tools almost always outperform in-house, even with strong in-house development teams, but now that the executive is 'the creator,' there's significantly less scrutiny on things like compatibility and security.

There's plenty real about AI, particularly as it relates to coding and information retrieval, but I'm yet to see an agent actually do something that even remotely feels like the result of deep and savvy reasoning (the precursor to AGI) - including all the examples in this post.

zer00eyz · 2026-02-04T17:51:11 1770227471

> Typing 'Find me reservations at X restaurant' and getting unformatted text back is way worse than just going to OpenTable and seeing a UI that has been honed for decades.

Your conflating the example with the opportunity:

"Cancel Service XXX" where the service is riddled with dark patterns. Giving every one an "assistant" that can do this is a game changer. This is why a lot of people who aren't that deep in tech think open claw is interesting.

> We all learned the lesson that mass-market IT tools almost always outperform in-house

Do they? Because I know a lot of people who have (as an example) terrible setups with sales force that they have to use.

candiddevmike · 2026-02-04T16:54:50 1770224090

I feel bad for whoever gets an oncall page that some executive's vibe coded app stopped working and needs to be fixed ASAP.

linschn · 2026-02-04T21:36:58 1770241018

> We all learned the lesson that mass-market IT tools almost always outperform in-house,

Funny, I learned the exact opposite lesson. Almost all software suck, and a good way for it not to suck is to know where the developer is and go tell them their shit is broken, in person.

If you want a large scale example, one of the two main law enforcement agency in france spun off libreoffice into their own legal writing software. Developped by LEOs that can take up to two weeks a year to work on that. Awesome software. Would cost litterally millions if bought on the market.

mrdependable · 2026-02-04T16:47:45 1770223665

Kind of funny to say you helped make the Harvard CS curriculum and then dropped out. Your own curriculum was not good enough for you? Probably extenuating circumstances, but still seems funny.

jen729w · 2026-02-04T17:31:39 1770226299

When I saw them buying $80 Arc'teryx gloves that was enough for me.

nunez · 2026-02-04T17:30:03 1770226203

Exeter had a hella good policy debate team back in the day. Probably still do; I've been out of the loop for a while.

skybrian · 2026-02-04T15:58:08 1770220688

Sure, but that also means they’re well-positioned to do a comparison.

verdverm · 2026-02-04T16:28:26 1770222506

Elites live in a different world from you and I.

AndrewKemendo · 2026-02-04T16:59:54 1770224394

You do understand that is who you’re competing with now right?

My daughter is a excellent student in high school

She and I spoke last night and she is increasingly pissed off that people who are in her classes, who don’t do the work, and don’t understand the material get all A’s because they’re using some form of GPT to do their assignments, and teachers cannot keep up

I do not see a world in the future where you can “come from behind” because all of the people with resources are increasingly not going to need experts who need money to survive to be able to do whatever they want to do

While that was technically true for the last few hundred years it was at least required to deal with other humans and you had to have some kind of at least veneer of communal engagement to do anything

That requirement is now gone and within the next decade I anticipate there will be a single person being able to build a extremely profitable software company with only two or three human employees

foobarian · 2026-02-04T17:21:17 1770225677

Ironically I feel like this may force schools to get better at the core mission of teaching, vs. credentialing people for the next rung on the ladder. What replaces that second function remains to be seen.

AndrewKemendo · 2026-02-04T21:07:41 1770239261

I think it actually will just make school even less relevant

Gagarin1917 · 2026-02-04T17:38:20 1770226700

>She and I spoke last night and she is increasingly pissed off that people who are in her classes, who don’t do the work, and don’t understand the material get all A’s because they’re using some form of GPT to do their assignments, and teachers cannot keep up

How do they do well on tests, then?

Surely the most they could get away with is homework and take-home writing assignments. Those are only a fraction of your grade, especially at “excellent” high schools.

taytus · 2026-02-04T17:06:19 1770224779

>You do understand that is who you’re competing with now right?

No. I'm competing with no one.

warkdarrior · 2026-02-04T17:13:57 1770225237

You may think you are not competing. The people whose money you may want (employers, investors, customers) definitely see you as one of many competitors for their funds.

AndrewKemendo · 2026-02-04T19:14:37 1770232477

Exactly

ActorNightly · 2026-02-04T20:24:10 1770236650

Wrong way to look at it.

Generally there are 2 types of human intelligence - simulation and pattern lookup (technically simulation still relies on pattern lookup but on a much lower level).

Pattern lookup is basically what llms do. Humans memorize the maps of tasks->solutions and statistically interpolate their knowledge to do a particular task. This works well enough for the vast majority of the people, and this is why LLMs are seen as a big help since they effectively increase your

Simulation type intelligence is able to break down a task into core components, and understand how each component interacts and predict outcomes into the future, without having knowledge beforehand.

For example, assume a task of cleaning the house:

Pattern lookup would rely on learned expereince taught by parents as well as experience in cleaning the house to perform an action. You would probably use a duster+generic cleaner to wipe surfaces, and vaccum the floors.

Simulation type intelligence would understand how much dirt / dust there is, how it behaves. For example, instead of a duster, one would realize that you can use a wet towel to gather dust, without ever having seen this used ever before.

Here is the kicker - pattern type intelligence is actually much harder to attain, because it requires really good memorization, which is pretty much genetic.

Simulation type intelligence is actually attainable by anyone - it requires much smaller subset of patterns to memorize. The key factor is changing how you think about the world, which requires realigning your values. If you start to value low level understanding, you naturally develop this intelligence.

For example, what would it take for you to completely take your car apart, figure out how every component works, and put it back together? A lot of you have garages and money to spend on a cheap car to do this and the tools, so doing this in your spare time is practical, and it will give you the ability to buy an older used car, do all the maintenance/repairs on it yourself on it, and have something that works well all for a lower price, while also giving you a monetizable skill.

Futhermore, LLMs can't reason with simulation - you can get close with agentic frameworks, but all of those are manually coded and have limits, and we aren't close to figuring out a generic framework for an agent that can make it do things like look up information, run internal models of how things would work, and so on.

So finally, when it comes to competing, if you chose to stick to pattern based intelligence, and you lose your job to someone who can use llms better, thats your fault.

AndrewKemendo · 2026-02-04T21:09:13 1770239353

At the longest timescale humans aren’t the best at either

I have yet to see a compelling argument demonstrating that humans have some special capabilities that could never be replaced

ActorNightly · 2026-02-06T17:07:58 1770397678

Sure. Its not a hardware problem though, but an algorithm problem. And to train something that behaves like a human can't be done with backpropagation in the way its implemented currently. You basically have to figure out how to train neural nets that are not only operating in parallel with scheduling, but also be able to iterate on their architecture.

dcre · 2026-02-02T19:20:02 1770060002

The use of the name Codex and the focus on diffs and worktrees suggests this is still more dev-focused than Cowork.

nxobject · 2026-02-02T23:59:12 1770076752

It's a smart move – while Codex has the same aspirations, limiting it to savvy power users will likely lead to better feedback, and less catastrophic misuse.

dcre · 2026-01-30T13:33:38 1769780018

Because they are too slow and not smart enough.

dcre · 2026-01-27T20:16:08 1769544968

Hard to think of a worse name. Maybe Moistbot?

dcre · 2026-01-27T14:12:57 1769523177

Sort of. It’s not necessarily a single call. In the general case it would be spinning up a long-running agent with various kinds of configuration — prompts, but also coding environment and which tools are available to it — like subagents in Claude Code.