This model appears to be full of surprises. The 50% drop in price for inputs and...

usaar333 · on Aug 7, 2024

> ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

The previous version of 4o also beat 3.5 Sonnet on Crux.

nunodonato · on Aug 7, 2024

which is a good hint that that benchmark sucks. No way 4o beats sonnet 3.5

lauralex · on Aug 7, 2024

Sonnet 3.5 has a lot of alignment issues. It many times refused to answer simple coding questions I asked, just because it considered them "unsafe". 4o is much more relaxed. Regarding math, sonnet is a bit better than 4o though.

ashu1461 · on Aug 7, 2024

I think they have secretly released something which is better than 4. In our internal benchmarks also the 4 o mini is performing better than 4 o

rvnx · on Aug 7, 2024

The weirdest is that ultimately the best model is supposed to be Gemini Pro according to these benchmarks