Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This model appears to be full of surprises.

The 50% drop in price for inputs and 33% for outputs vs. the previous 4o model is huge.

It also appears to be topping various benchmarks, ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

Shameless plug, I'm the co-founder of Double.bot (YC W23). After seeing the leaderboard above we actually added it to our copilot for anyone to try for free [2]. We try to add all new models the same day they are released

[0]https://huggingface.co/spaces/allenai/ZeroEval

[1]https://crux-eval.github.io/

[2]https://double.bot/



> ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

The previous version of 4o also beat 3.5 Sonnet on Crux.


which is a good hint that that benchmark sucks. No way 4o beats sonnet 3.5


Sonnet 3.5 has a lot of alignment issues. It many times refused to answer simple coding questions I asked, just because it considered them "unsafe". 4o is much more relaxed. Regarding math, sonnet is a bit better than 4o though.


I think they have secretly released something which is better than 4. In our internal benchmarks also the 4 o mini is performing better than 4 o


The weirdest is that ultimately the best model is supposed to be Gemini Pro according to these benchmarks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: