Hacker Newsnew | past | comments | ask | show | jobs | submit | joshuahedlund's commentslogin

It has, tho the rate of new record highs have been reducing from peak to peak: 10x > 3x > 1.5x


Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.


it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private


Scott Alexander blogged about it today: https://www.astralcodexten.com/p/best-of-moltbook


> If your job is to translate requirements into code manually - and that's it - you're the generalist travel agent.

I’ve been a full-stack web programmer at five different companies over the last fifteen years, big and small, e-commerce and B2B, junior to senior to staff, and that has never fully described my responsibilities.


Which responsibilities do you figure are a combination of highly valuable in your role, and resistant to automation?


Knowing what to implement, and having the social skills to perform various tasks in a company?


I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.


> most normal people don't know what Claude or Gemini are

“Google Gemini” is the No 2 ranked app in the Apple App Store (behind ChatGTP) and has been for some time


https://en.wikipedia.org/wiki/Goodhart%27s_law "When a measure becomes a target, it ceases to be a good measure"

I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.


Won’t those models gradually become outdated (for anything related to events that happen after the model was trained, new code languages or framework versions, etc) if no one is around to continually re-train them?


They should be fine for things that don't change. (which is a lot of stuff)

If you are feeding the LLM a report, and asking it for a summary, it doesn't need the latest updates from Wikipedia or Reddit.


How about denying the Fourth Amendment rights of US citizens to be secure in their homes in the recent Chicago apartment raid? https://www.notesfromthecircus.com/p/the-sufferable-evil

How about detaining US citizens without warrants for days at a time and then releasing with no charges? https://www.theatlantic.com/politics/archive/2025/09/george-...


Let's hope the people not subject to a warrant sue ICE's pants off. As far as I can tell, most of the dragnets are either in public places or with the permission of the property owner.

I'd love to be wrong because it means the judciary has a chance to shut this down but I fear outside of a few civil rights suits this will have to be remedied at the ballot box.


It's essentially impossible to sue ICE, they have qualified immunity. Police have been doing similar things for years.


The officers do, not the agency or police department. People sue and win against police departments all the time.


There is a hard limit on the number of atomic elements, and an even smaller limit on the number of soluble compounds that facilitate chemical reactions, and water is demonstrably both the best and the most common in the universe.

So while it may be possible for life to exist without water, any alternatives should be reasonably expected to be even more rare than water-based life


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: