Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run evaluation on full SWE-Bench #1693

Open
rezzie-rich opened this issue May 10, 2024 · 22 comments
Open

Run evaluation on full SWE-Bench #1693

rezzie-rich opened this issue May 10, 2024 · 22 comments
Assignees
Labels
enhancement New feature or request severity:medium Affecting multiple users

Comments

@rezzie-rich
Copy link

Love the progress so far!

Will you guys test and publish the full swe-bench and the 25% subset test besides just the swe-bench lite?

On auto-code-rover repo, it says 22% on swe-bench lite and 16% on full swe-bench. However, you guys have ACR at 16% on swe-bench lite. Is that the result you guys got or a typo?

@rbren
Copy link
Collaborator

rbren commented May 10, 2024

Thanks for pointing that out! Not sure if it was a typo, or if we were using an old result of theirs.

Let's remove the graph until we can generate a better one

@neubig
Copy link
Contributor

neubig commented May 10, 2024

This number is from the most recent version of the AutoCodeRover paper! I think we should clarify this in the graph.

@frankxu2004
Copy link
Collaborator

frankxu2004 commented May 10, 2024

From the ACR paper, it shows:
image

Note that ACR-avg is the comparable number here, as it's the average of 3 runs (meaning that it's pass@1 rate). I see that the 22.33% number inside the repo is the ACR-all, which is the union of 3 runs ( meaning that it's pass@3 rate). I think it's still valid number comparison that for pass@1, ACR is 16%

@rbren
Copy link
Collaborator

rbren commented May 10, 2024

Ahh thank you for the clarification! I will remove my PR

@neubig neubig closed this as completed May 10, 2024
@neubig neubig reopened this May 10, 2024
@neubig neubig changed the title SWE-bench result Re-run on full SWE-Bench May 10, 2024
@neubig
Copy link
Contributor

neubig commented May 10, 2024

So I think the AutoCodeRover is fine as-is, but I agree we should still run on all of SWE-Bench. The main bottleneck for this is time and cost, it costs about $6,000 to run on all of SWE-Bench with GPT-4

@neubig neubig changed the title Re-run on full SWE-Bench Run evaluation on full SWE-Bench May 10, 2024
@neubig neubig added enhancement New feature or request severity:medium Affecting multiple users and removed question labels May 10, 2024
@libowen2121
Copy link
Contributor

@rezzie-rich Thank you for the question! As @frankxu2004 clarified, we only report the pass@1 results in the graph. Our evaluation containerization only supports SWE-bench-lite for now and we will extend it to support the full test set!

@rezzie-rich
Copy link
Author

Gpt-4 is expensive. I think it would be cool if you guys could run the full bench using llama3 70b and 8b as it would give a unique and realistic expectation of running with open llm.

It's hard to compare swe-bench with humans, but as a rule of thumb, an average jr. developer should be able to complete 10-25% while an average sr. developer can 20-40%.

If we can have opendevin complete 25%+ using an open llm (preferably with less than 34b parameters), it's a game changer!

@rezzie-rich
Copy link
Author

rezzie-rich commented May 11, 2024

https://evalplus.github.io/leaderboard.html

Leaderboard for open code llm is kinda diluted. However, i found this up to date leaderboard that seems pretty legit.

It has codeQwen-1.5-7b-chat listed above Claude-3-opus right next to gpt-4. Small llm like this should be able to run the bench faster and lot cheaper compared to gpt-4.

If the leaderboard is accurate, it makes codeQwen a valid replacement for gpt-4.

If opendevin can complete 20-25% of the full swe-bench using a 7b model, that would prove the practically and real use case of ai agents in software development.

My thoughts:
Testing the agents on smaller models will also be good for marketing and user satisfaction as well as improve the agents' quality. Since most will try opendevin seeing the gpt-4 results but use it with a local model for budget, it creates an unsatisfying experience. Instead, they could try seeing the local models score and replicate the results, making it more satisfying. Which also leaves room for more performance gain once used with closed llm. It's better to promise less than the offering.

@rezzie-rich
Copy link
Author

rezzie-rich commented May 18, 2024

https://chat.lmsys.org/?leaderboard

llama3-70b-instruct is performing better than half of GPT-4 versions. I think it would be great to have benchmarks done using llama3 in the spirit of open-source community while keeping the usage practical.

I know the quant models degrade in performance. however, Q8 models are almost indistinguishable from fp16. a modern performance CPU with 128 GB RAM can easily handle it while keeping it relatively cheaper.

@xingyaoww
Copy link
Contributor

xingyaoww commented May 19, 2024

@rezzie-rich Good point -- However, Llama-3 only has 8k context window which means it is hardly useful in our agent usecases. I just tested the recent deepseek-V2 MoE - Check results here: https://huggingface.co/spaces/OpenDevin/evaluation

It got ~5% on SWE-Bench lite, and from what i can tell qualitatively, a lot of error cases (~70%) are due to limited context window (32k) of their API. I can only imagine this been way worse on llama-3 due to its 8k window.

@rezzie-rich
Copy link
Author

rezzie-rich commented May 19, 2024

https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k

This version has a million context window.

Btw, LOVE the new huggingface space!

@xingyaoww
Copy link
Contributor

@rezzie-rich Thanks a ton for sharing!!! Will try to get some GPU and test it right away!!!

@BradKML
Copy link

BradKML commented Jun 3, 2024

Seconding this but not just switching models like @rezzie-rich does (great idea BTW if it can be included into Ollama or some other tool), but also are there any alternative benchmarks for seeing how good they can solve competitive coding problems (or data science problems) for confirmation of quality over mere LLM? Maybe mixing big and small LLMs (e.g. Qwen + LLaMA combo) for added acceleration?

@yuntongzhang
Copy link

Hi, I'm late to the discussion, but would like to update on the pass@1 score in the original AutoCodeRover paper.

Turns out that the SWE-bench evaluation environment used in our original experiments gives underestimated scores due to missing of system-level dependencies. Some correct patches were deemed as wrong after running the SWE-bench acceptance tests in that environment.

Thanks to the SWE-bench-docker project, our original patches were re-evaluated, and the actual pass@1 score is 19% instead of 16%. More details can be found here. The 19% pass@1 score is also reflected on SWE-bench leaderboard.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Jul 25, 2024
@0xdevalias
Copy link

Shouldn't be stale IMO

@xingyaoww xingyaoww removed the Stale Inactive for 30 days label Jul 25, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Aug 25, 2024
@xingyaoww xingyaoww removed the Stale Inactive for 30 days label Aug 25, 2024
@xingyaoww
Copy link
Contributor

xingyaoww commented Aug 25, 2024

Some updates: we are making some progress on the infrastructure side - hopefully, we can resolve this in ~2 weeks!

Running OpenHands with 2000 docker containers effecienctly is not an easy task 😓

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Sep 25, 2024
@enyst enyst removed the Stale Inactive for 30 days label Sep 25, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Oct 28, 2024
@enyst enyst removed the Stale Inactive for 30 days label Oct 28, 2024
@BradKML
Copy link

BradKML commented Oct 28, 2024

Looking forward to this TBH, if only there is a way to decentralize evaluation between multiple volunteers... But replicability is also an issue

@xingyaoww
Copy link
Contributor

Hopefully, we may be able to do this soon for #4537 🙏 (with some budget constraint)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request severity:medium Affecting multiple users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants