-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run evaluation on full SWE-Bench #1693
Comments
Thanks for pointing that out! Not sure if it was a typo, or if we were using an old result of theirs. Let's remove the graph until we can generate a better one |
This number is from the most recent version of the AutoCodeRover paper! I think we should clarify this in the graph. |
Note that ACR-avg is the comparable number here, as it's the average of 3 runs (meaning that it's pass@1 rate). I see that the 22.33% number inside the repo is the ACR-all, which is the union of 3 runs ( meaning that it's pass@3 rate). I think it's still valid number comparison that for pass@1, ACR is 16% |
Ahh thank you for the clarification! I will remove my PR |
So I think the AutoCodeRover is fine as-is, but I agree we should still run on all of SWE-Bench. The main bottleneck for this is time and cost, it costs about $6,000 to run on all of SWE-Bench with GPT-4 |
@rezzie-rich Thank you for the question! As @frankxu2004 clarified, we only report the pass@1 results in the graph. Our evaluation containerization only supports SWE-bench-lite for now and we will extend it to support the full test set! |
Gpt-4 is expensive. I think it would be cool if you guys could run the full bench using llama3 70b and 8b as it would give a unique and realistic expectation of running with open llm. It's hard to compare swe-bench with humans, but as a rule of thumb, an average jr. developer should be able to complete 10-25% while an average sr. developer can 20-40%. If we can have opendevin complete 25%+ using an open llm (preferably with less than 34b parameters), it's a game changer! |
https://evalplus.github.io/leaderboard.html Leaderboard for open code llm is kinda diluted. However, i found this up to date leaderboard that seems pretty legit. It has codeQwen-1.5-7b-chat listed above Claude-3-opus right next to gpt-4. Small llm like this should be able to run the bench faster and lot cheaper compared to gpt-4. If the leaderboard is accurate, it makes codeQwen a valid replacement for gpt-4. If opendevin can complete 20-25% of the full swe-bench using a 7b model, that would prove the practically and real use case of ai agents in software development. My thoughts: |
https://chat.lmsys.org/?leaderboard llama3-70b-instruct is performing better than half of GPT-4 versions. I think it would be great to have benchmarks done using llama3 in the spirit of open-source community while keeping the usage practical. I know the quant models degrade in performance. however, Q8 models are almost indistinguishable from fp16. a modern performance CPU with 128 GB RAM can easily handle it while keeping it relatively cheaper. |
@rezzie-rich Good point -- However, Llama-3 only has 8k context window which means it is hardly useful in our agent usecases. I just tested the recent deepseek-V2 MoE - Check results here: https://huggingface.co/spaces/OpenDevin/evaluation It got ~5% on SWE-Bench lite, and from what i can tell qualitatively, a lot of error cases (~70%) are due to limited context window (32k) of their API. I can only imagine this been way worse on llama-3 due to its 8k window. |
https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k This version has a million context window. Btw, LOVE the new huggingface space! |
@rezzie-rich Thanks a ton for sharing!!! Will try to get some GPU and test it right away!!! |
Seconding this but not just switching models like @rezzie-rich does (great idea BTW if it can be included into Ollama or some other tool), but also are there any alternative benchmarks for seeing how good they can solve competitive coding problems (or data science problems) for confirmation of quality over mere LLM? Maybe mixing big and small LLMs (e.g. Qwen + LLaMA combo) for added acceleration? |
Hi, I'm late to the discussion, but would like to update on the pass@1 score in the original AutoCodeRover paper. Turns out that the SWE-bench evaluation environment used in our original experiments gives underestimated scores due to missing of system-level dependencies. Some correct patches were deemed as wrong after running the SWE-bench acceptance tests in that environment. Thanks to the SWE-bench-docker project, our original patches were re-evaluated, and the actual pass@1 score is 19% instead of 16%. More details can be found here. The 19% pass@1 score is also reflected on SWE-bench leaderboard. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Shouldn't be stale IMO |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Some updates: we are making some progress on the infrastructure side - hopefully, we can resolve this in ~2 weeks! Running OpenHands with 2000 docker containers effecienctly is not an easy task 😓 |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Looking forward to this TBH, if only there is a way to decentralize evaluation between multiple volunteers... But replicability is also an issue |
Hopefully, we may be able to do this soon for #4537 🙏 (with some budget constraint) |
Love the progress so far!
Will you guys test and publish the full swe-bench and the 25% subset test besides just the swe-bench lite?
On auto-code-rover repo, it says 22% on swe-bench lite and 16% on full swe-bench. However, you guys have ACR at 16% on swe-bench lite. Is that the result you guys got or a typo?
The text was updated successfully, but these errors were encountered: