swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.
It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.
It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.