> Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively
Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.
Actually… no. Now that you mention it, and thanks for the interesting thought, the failure modes seem pretty similar to me.
Shoddy research / hallucination, tendency to lose the thread, lack of historical / background context… the failure modes are at least qualitatively similar.
Show me an LLM failure and I’ll show you a high profile journalist busted for the same thing. And those are humans who focus on these things!
Humans as a class are error prone but some humans in their respective fields are very very good. It's often not terribly hard to figure out based on resume and credentials who these folks are and as a shortcut we can look for markers in terms of terminology specifics confidence if it's less important like deciding what to read vs cancer care for your mom.
AI can trip all the right searches to fool these shortcuts whilst sometimes being entirely full of shit and they have no resume nor credentials to verify should we desire to check.
If you have such and vouch for it I can consider your trustworthiness rather than its. If you admit you yourself are reliant on it then this no longer holds
Humans also write endless amounts of convincing bullshit, and have done since time immemorial. False papers and faked results have been a growing scourge in academia before LLMs were a thing, and that's just counting the intentional fraud - the reproducibility crisis in science, especially medical and psychological science, affects even the best designed and well intentioned of studies.
Humans also make mistakes and assumptions while reverse engineering, so it will always need more engineers to go through the results, test things
Benchmarks all in part 2. Training progress in part 3(upcoming)
Also I think AI human collaboration is important for goal management.
Sure LLMs bullshit all the time, but that's the role of the human to create good goals and gating criteria to what constitutes as good.
Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.