More

joshuahedlund · 2026-02-08T15:17:48 1770563868

It has, tho the rate of new record highs have been reducing from peak to peak: 10x > 3x > 1.5x

joshuahedlund · 2026-02-05T18:30:36 1770316236

Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.

Snuggly73 · 2026-02-05T18:53:23 1770317603

it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private

joshuahedlund · 2026-01-30T21:32:00 1769808720

Scott Alexander blogged about it today: https://www.astralcodexten.com/p/best-of-moltbook

joshuahedlund · 2025-12-28T01:29:16 1766885356

> If your job is to translate requirements into code manually - and that's it - you're the generalist travel agent.

I’ve been a full-stack web programmer at five different companies over the last fifteen years, big and small, e-commerce and B2B, junior to senior to staff, and that has never fully described my responsibilities.

wahnfrieden · 2025-12-28T01:36:56 1766885816

Which responsibilities do you figure are a combination of highly valuable in your role, and resistant to automation?

esafak · 2025-12-28T05:17:00 1766899020

Knowing what to implement, and having the social skills to perform various tasks in a company?

joshuahedlund · 2025-12-11T20:33:55 1765485235

I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.

joshuahedlund · 2025-11-30T03:25:35 1764473135

> most normal people don't know what Claude or Gemini are

“Google Gemini” is the No 2 ranked app in the Apple App Store (behind ChatGTP) and has been for some time

joshuahedlund · 2025-10-15T20:58:32 1760561912

https://en.wikipedia.org/wiki/Goodhart%27s_law "When a measure becomes a target, it ceases to be a good measure"

I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.

joshuahedlund · 2025-10-12T21:20:46 1760304046

Won’t those models gradually become outdated (for anything related to events that happen after the model was trained, new code languages or framework versions, etc) if no one is around to continually re-train them?

jay_kyburz · 2025-10-12T23:36:41 1760312201

They should be fine for things that don't change. (which is a lot of stuff)

If you are feeding the LLM a report, and asking it for a summary, it doesn't need the latest updates from Wikipedia or Reddit.

joshuahedlund · 2025-10-07T21:31:08 1759872668

How about denying the Fourth Amendment rights of US citizens to be secure in their homes in the recent Chicago apartment raid? https://www.notesfromthecircus.com/p/the-sufferable-evil

How about detaining US citizens without warrants for days at a time and then releasing with no charges? https://www.theatlantic.com/politics/archive/2025/09/george-...

cosmicgadget · 2025-10-08T03:14:40 1759893280

Let's hope the people not subject to a warrant sue ICE's pants off. As far as I can tell, most of the dragnets are either in public places or with the permission of the property owner.

I'd love to be wrong because it means the judciary has a chance to shut this down but I fear outside of a few civil rights suits this will have to be remedied at the ballot box.

habinero · 2025-10-08T13:12:21 1759929141

It's essentially impossible to sue ICE, they have qualified immunity. Police have been doing similar things for years.

cosmicgadget · 2025-10-08T14:16:50 1759933010

The officers do, not the agency or police department. People sue and win against police departments all the time.

joshuahedlund · 2025-10-04T12:37:44 1759581464

There is a hard limit on the number of atomic elements, and an even smaller limit on the number of soluble compounds that facilitate chemical reactions, and water is demonstrably both the best and the most common in the universe.

So while it may be possible for life to exist without water, any alternatives should be reasonably expected to be even more rare than water-based life