Possible underestimated pass@3 results #45

aorwall · 2024-05-12T18:38:12Z

I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D

You find the logs and report here

And here's a sheet I use to compare the results.

zhiyufan · 2024-05-13T04:51:28Z

Thank you for this detailed report!
I quickly went through the comparison sheet you shared and noticed that there are many opposite acr-2/3 results in the two evaluation environments. Maybe you mistakenly mixed up the acr-2/3 results?

aorwall · 2024-05-13T05:08:04Z

You're right. Good catch. That explains why it looked like some of the benchmark instances failed in my benchmark. Now it covers all the cases you resolved.

yuntongzhang · 2024-05-13T12:34:15Z

This looks great! The dockerized evaluation environment would be very useful for obtaining consistent evaluation results. We will also try out your docker-based evaluation on our machines :)

aorwall changed the title ~~Possible Underestimated pass@3 Results~~ Possible underestimated pass@3 results May 12, 2024

yuntongzhang mentioned this issue May 30, 2024

Submission for AutoCodeRover-v20240408 swe-bench/experiments#11

Merged

This was referenced Jun 24, 2024

Run evaluation on full SWE-Bench All-Hands-AI/OpenHands#1693

Open

Implement new agent using AutoCodeRover's approach All-Hands-AI/OpenHands#942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible underestimated pass@3 results #45

Possible underestimated pass@3 results #45

aorwall commented May 12, 2024 •

edited

Loading

zhiyufan commented May 13, 2024

aorwall commented May 13, 2024

yuntongzhang commented May 13, 2024

Possible underestimated pass@3 results #45

Possible underestimated pass@3 results #45

Comments

aorwall commented May 12, 2024 • edited Loading

zhiyufan commented May 13, 2024

aorwall commented May 13, 2024

yuntongzhang commented May 13, 2024

aorwall commented May 12, 2024 •

edited

Loading