-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible underestimated pass@3 results #45
Comments
Thank you for this detailed report! |
You're right. Good catch. That explains why it looked like some of the benchmark instances failed in my benchmark. Now it covers all the cases you resolved. |
This looks great! The dockerized evaluation environment would be very useful for obtaining consistent evaluation results. We will also try out your docker-based evaluation on our machines :) |
I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D
You find the logs and report here
And here's a sheet I use to compare the results.
The text was updated successfully, but these errors were encountered: