Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible underestimated pass@3 results #45

Open
aorwall opened this issue May 12, 2024 · 3 comments
Open

Possible underestimated pass@3 results #45

aorwall opened this issue May 12, 2024 · 3 comments

Comments

@aorwall
Copy link

aorwall commented May 12, 2024

I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D

You find the logs and report here

And here's a sheet I use to compare the results.

@aorwall aorwall changed the title Possible Underestimated pass@3 Results Possible underestimated pass@3 results May 12, 2024
@zhiyufan
Copy link
Collaborator

Thank you for this detailed report!
I quickly went through the comparison sheet you shared and noticed that there are many opposite acr-2/3 results in the two evaluation environments. Maybe you mistakenly mixed up the acr-2/3 results?

@aorwall
Copy link
Author

aorwall commented May 13, 2024

You're right. Good catch. That explains why it looked like some of the benchmark instances failed in my benchmark. Now it covers all the cases you resolved.

@yuntongzhang
Copy link
Collaborator

This looks great! The dockerized evaluation environment would be very useful for obtaining consistent evaluation results. We will also try out your docker-based evaluation on our machines :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants