At first I thought it was brain slip in the HN title, then I saw TFA also said "...

HarHarVeryFunny · 2026-02-27T13:20:25 1772198425

It would also be interesting to see how well the best open weights models such as Kimi K2.5 can do on a task like this with the same prompting to first gather specs, etc, etc.

In fact this would make for an interesting benchmark - writing entire non-trivial apps based on the same prompt. Each model might be expected to write and use it's own test cases, but then all could be judged based on a common set of test cases provided as part of the benchmark suite.