Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation, WebArena 2.0, Evaluation Cache #62

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

shuyanzhou
Copy link
Contributor

@shuyanzhou shuyanzhou commented Sep 8, 2024

Documentation

Update README files for compatibility with both WebArena (WA) and VisualWebArena (VWA).

WebArena 2.0

WebArena 2.0 addresses annotation issues reported by various users. Specifically:

  • WebArena 2.0 minimizes the use of exact_match and must_include for information-seeking tasks with StringEvaluator. The migration from old evaluators to new ones generally follows these rules:
- exact_match -> fuzzy_exact_match
- must_include, fuzzy_match
    - If the list contains 1 item -> fuzzy_exact_match
    - If the list contains > 1 item
       - For elements on the same level, same topic -> fuzzy_must_include
       - For elements on different aspects -> context_qa
- na -> fuzzy_na_match, which explicitly evaluates the reasoning behind unachievable outcomes.
- Reddit post-related -> qa. 
    - `context_qa` evaluates content based on both intent and answer.
    - `qa` evaluates based only on the answer, as the intent is not relevant.

The prompts are tested in evaluation_harness/eval_evaluators.

  • Other fixes
**Fix from github issues***
https://github.com/web-arena-x/webarena/issues/100
2: product type is very vague. Removed
3: update the intent to indicate tied rank
4: update the intent to indicate tied rank
5: type is too vague, add the scope

https://github.com/web-arena-x/webarena/issues/135
45: update the intent to be more accurate

https://github.com/web-arena-x/webarena/issues/137
425: update the intent to be more accurate

**Individual fix**
Template 324, remove ranking requirement. 
Template 204: Use a combination of context_qa and must_include. 
792, 793 were deleted because the reason is not very sound
Fix errors found by THU group [THU-Webarena-lite Bug Fixing](https://docs.google.com/spreadsheets/d/13BRuRlU_Z_UBcucjQ5myvrRdB0P0ID3Nj-dWlzawuYo/edit#gid=1021875443) 

**Typo, grammar**
by far -> so far
https://github.com/web-arena-x/webarena/issues/133
correpong -> corresponding 
telll -> tell
canlled -> cancelled
what could -> how could
competative -> competitive

Evaluator

Support result cache so that evaluation can be run offline. This is helpful if we accept submissions in the future. The participants only needs to upload their cached files and we can perform evaluation quickly without reruning their models

@shuyanzhou shuyanzhou changed the title Document update and evaluation update [WIP] Document update and evaluation update Sep 23, 2024
@shuyanzhou
Copy link
Contributor Author

@kohjingyu @ljang0 can you check a few examples on vwa to make sure it is not broken?

@shuyanzhou shuyanzhou changed the title Document update and evaluation update Documentation, WebArena 2.0, Evaluation Cache Sep 23, 2024
@kohjingyu kohjingyu mentioned this pull request Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant