Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit tool prompt tweaking: only plain-text format is supported #6067

Merged
merged 4 commits into from
Jan 22, 2025

Conversation

li-boxuan
Copy link
Collaborator

@li-boxuan li-boxuan commented Jan 6, 2025

End-user friendly description of the problem this fixes or functionality that this introduces

Tweak edit-related prompts to clarify that edit tools are for plain-text format only.

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Without this PR:

Screenshot 2025-01-05 at 8 45 47 PM Screenshot 2025-01-05 at 8 44 41 PM

With this PR:

Screenshot 2025-01-05 at 9 02 50 PM

and I verified the result is indeed of MS-doc format.

As a follow-up, we could create some micro-agents to remind LLM of python libraries related to office, such as python-docx.


Link of any specific issues this addresses

This is motivated by a failure I observed when running TheAgentCompany benchmark. Specifically, this one: https://github.com/TheAgentCompany/TheAgentCompany/blob/d456bb0d0c528e3495b6400f156be02695d6731f/workspaces/tasks/ds-answer-numerical-data-question/task.md?plain=1#L9


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:a9297a0-nikolaik   --name openhands-app-a9297a0   docker.all-hands.dev/all-hands-ai/openhands:a9297a0

@li-boxuan li-boxuan added the run-eval-m Runs evaluation with 30 instances label Jan 6, 2025
Copy link
Contributor

github-actions bot commented Jan 6, 2025

Running evaluation on the PR. Once eval is done, the results will be posted.

@li-boxuan li-boxuan requested review from xingyaoww and neubig and removed request for xingyaoww January 6, 2025 06:45
@mamoodi mamoodi added run-eval-m Runs evaluation with 30 instances and removed run-eval-m Runs evaluation with 30 instances labels Jan 6, 2025
Copy link
Contributor

github-actions bot commented Jan 6, 2025

Running evaluation on the PR. Once eval is done, the results will be posted.

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 6, 2025

The eval pipeline seems broken right now :(

@mamoodi mamoodi added run-eval-m Runs evaluation with 30 instances and removed run-eval-m Runs evaluation with 30 instances labels Jan 6, 2025
Copy link
Contributor

github-actions bot commented Jan 6, 2025

Running evaluation on the PR. Once eval is done, the results will be posted.

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 6, 2025

@li-boxuan got it working to a degree. Some issue with posting final comment.. Here it is:
evaluation.zip

@li-boxuan
Copy link
Collaborator Author

Here's the report - seems lower than expected? I am not sure what the baseline shall be but I thought it should be near 50%

SWE-bench Report

This folder contains the evaluation results of the SWE-bench using the official evaluation docker containerization.

Summary

  • submitted instances: 30
  • empty patch instances: 2
  • resolved instances: 10
  • unresolved instances: 20
  • error instances: 0

Resolved Instances

Unresolved Instances

Error Instances

Empty Patch Instances

Incomplete Instances

@li-boxuan
Copy link
Collaborator Author

I noticed that astropy__astropy-12907 was passing at some point: https://github.com/swe-bench/experiments/blob/203cc8fa5b8322983445b314da7038b8b6653d5f/evaluation/verified/20241029_OpenHands-CodeAct-2.1-sonnet-20241022/results/results.json#L14C6-L14C28

I inspected the evaluation log and I don't think the failure was related to this PR. It's either randomness, or a regression somewhere else.

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 7, 2025

Yeah a bit lower. I think 14-15 resolved was the baseline but as you mentioned it might be randomness. Xingyao would know better for sure.

@enyst
Copy link
Collaborator

enyst commented Jan 7, 2025

You may want to run compare_outputs.py:

# diff=6
----------------------------------------------------------------------------------------------------
# x resolved but y not=4
               instance_id                                           report_x                                           report_y
0   astropy__astropy-12907  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
14    django__django-11001  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
2     django__django-11039  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
4     django__django-11179  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
----------------------------------------------------------------------------------------------------
# y resolved but x not=2
             instance_id                                           report_x                                           report_y
18  django__django-10924  {'empty_generation': False, 'resolved': False,...  {'empty_generation': False, 'resolved': True, ...
19  django__django-11049  {'empty_generation': False, 'resolved': False,...  {'empty_generation': False, 'resolved': True, ...
----------------------------------------------------------------------------------------------------
Repository comparison (x resolved vs y resolved):

astropy:
Difference: 1 instances!
X resolved but Y failed: (1 instances)
  ['astropy__astropy-12907']
Y resolved but X failed: (0 instances)

django:
Difference: 1 instances!
X resolved but Y failed: (3 instances)
  ['django__django-11001', 'django__django-11039', 'django__django-11179']
Y resolved but X failed: (2 instances)
  ['django__django-10924', 'django__django-11049']

@xingyaoww
Copy link
Collaborator

xingyaoww commented Jan 7, 2025

@mamoodi
Can we also ran an eval on main to compare with this result? 🤔

If they are comparable, this is probably ok

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 7, 2025

@li-boxuan and @xingyaoww for main:
evaluation-main.zip

@xingyaoww
Copy link
Collaborator

Evaluation results ready for t-main: t-main_25-01-07-17-31.tar.gz

Summary

  • submitted instances: 29
  • empty patch instances: 1
  • resolved instances: 13
  • unresolved instances: 16
  • error instances: 0

Looks like we do have some degradation? 13 -> 10?

@neubig
Copy link
Contributor

neubig commented Jan 8, 2025

Hard to say, maybe we can run a bigger eval?

@li-boxuan
Copy link
Collaborator Author

Seems the noise might have come from All-Hands-AI/openhands-aci#48

discussion on slack: https://openhands-ai.slack.com/archives/C080M7BBSSG/p1736346851359599

@xingyaoww
Copy link
Collaborator

@mamoodi do we have larger eval? e.g., 100?

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 9, 2025

@xingyaoww I can run one with the pipeline but it will have to be manually. Did you want me to go ahead with that? With Sonnet 3.5?

@neubig
Copy link
Contributor

neubig commented Jan 10, 2025

@mamoodi , yeah I think it'd be good to run that! We could click "update branch" and then run it on main and this branch.

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 14, 2025

Sorry folks it's been a busy few days. Is this still pending a 100 instance eval?
I still need to run one on main to figure out what the baseline is. I'm planning to run one on main tonight.
Did you all want to run 100 instance on this PR afterwards? Perhaps tomorrow?

@neubig
Copy link
Contributor

neubig commented Jan 14, 2025

Yeah, sounds good!

@neubig neubig assigned mamoodi and unassigned neubig Jan 18, 2025
@mamoodi
Copy link
Collaborator

mamoodi commented Jan 19, 2025

Alright. Sorry about the wait folks. Here are the results.
I will mention that the main result was from Jan 14 and some changes have gone in which may have affected this.
I believe @xingyaoww will know the real values of the current main.

Eval 100 instances for main:

Summary

  • submitted instances: 100
  • empty patch instances: 12
  • resolved instances: 37
  • unresolved instances: 63
  • error instances: 0

Eval 100 instances for this PR:

Summary

  • submitted instances: 100
  • empty patch instances: 14
  • resolved instances: 43
  • unresolved instances: 57
  • error instances: 0

The sizes of the evals are too large to upload here so I will post it on Slack.

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the positive result (43/100) shared by @mamoodi - I believe this PR should be ready to merged. It at least got better than my last one (41/100): #6318 (comment)

@li-boxuan li-boxuan merged commit f9ba16b into main Jan 22, 2025
15 checks passed
@li-boxuan li-boxuan deleted the boxuan/tweak-edit-prompt branch January 22, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-eval-m Runs evaluation with 30 instances
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants