Author here. The protocol takes about 90 seconds to run — open any chatbot and try it before reading the comments.
Step 1: Ask the LLM whether "a human with a sufficient level of a certain ability" cannot lose a debate to a current-architecture LLM. True or false?
Step 2: After it commits to an answer, tell it the ability is reframing — restructuring the premises of the discussion itself.
Step 3: Watch what it does.
I've tested this across GPT-4o, Claude, Gemini, and o1/o3. The failure modes are remarkably consistent. Curious whether anyone sees a different result.
The formal treatment is in two papers currently under review (linked in the article). Happy to discuss the architectural argument here.
Author here. I'm a VPoE and CTO Association senior member in Japan who has mentored 10+ engineers into CTO roles. This essay was triggered by watching a startup CEO publicly ask "what does a good engineer even mean in the AI age?" — two weeks after cutting short an interview with a senior engineer whose track record included 200x performance optimizations and national-scale system architecture. He didn't read the resume.
The thesis: AI didn't create the evaluation problem. It exposed it. "Writes code" was the only visible proxy non-engineers had for judging engineering talent. AI killed that proxy. Now the underlying ignorance is visible — and the people most affected are making hiring/firing decisions for the entire industry.
The data is brutal: METR's RCT found experienced devs were 19% slower with AI while believing they were 20% faster. OpenAI announced hiring freezes then doubled headcount 54 days later. Amazon mandated AI coding tools then held emergency safety meetings 90 days later. 55% of companies regret AI-driven layoffs.
Curious what HN thinks — especially from engineers who've experienced the evaluation gap firsthand.
Author here. I'm a CTO with 15+ years of production engineering (Scala, Rust, large-scale systems). When I saw the Stripe blog, the number that jumped out wasn't 1,300 — it was 1,300 / 3,500 engineers = 0.37 PRs per person per week.The deeper issue is the industry pattern: every company announces impressive AI coding metrics, and every independent study (METR, DORA, GitClear, Faros AI) shows those metrics don't translate to organizational outcomes. I have a paper under review at ACM Computing Surveys synthesizing 37 studies covering 500,000+ developers — the central finding is zero organizational throughput improvement despite 20-55% individual gains.Happy to engage with pushback.
Step 1: Ask the LLM whether "a human with a sufficient level of a certain ability" cannot lose a debate to a current-architecture LLM. True or false?
Step 2: After it commits to an answer, tell it the ability is reframing — restructuring the premises of the discussion itself.
Step 3: Watch what it does.
I've tested this across GPT-4o, Claude, Gemini, and o1/o3. The failure modes are remarkably consistent. Curious whether anyone sees a different result.
The formal treatment is in two papers currently under review (linked in the article). Happy to discuss the architectural argument here.