Documentation, WebArena 2.0, Evaluation Cache #62

shuyanzhou · 2024-09-08T06:12:00Z

Documentation

Update README files for compatibility with both WebArena (WA) and VisualWebArena (VWA).

WebArena 2.0

WebArena 2.0 addresses annotation issues reported by various users. Specifically:

WebArena 2.0 minimizes the use of exact_match and must_include for information-seeking tasks with StringEvaluator. The migration from old evaluators to new ones generally follows these rules:

- exact_match -> fuzzy_exact_match
- must_include, fuzzy_match
    - If the list contains 1 item -> fuzzy_exact_match
    - If the list contains > 1 item
       - For elements on the same level, same topic -> fuzzy_must_include
       - For elements on different aspects -> context_qa
- na -> fuzzy_na_match, which explicitly evaluates the reasoning behind unachievable outcomes.
- Reddit post-related -> qa. 
    - `context_qa` evaluates content based on both intent and answer.
    - `qa` evaluates based only on the answer, as the intent is not relevant.

The prompts are tested in evaluation_harness/eval_evaluators.

Other fixes

**Fix from github issues***
https://github.com/web-arena-x/webarena/issues/100
2: product type is very vague. Removed
3: update the intent to indicate tied rank
4: update the intent to indicate tied rank
5: type is too vague, add the scope

https://github.com/web-arena-x/webarena/issues/135
45: update the intent to be more accurate

https://github.com/web-arena-x/webarena/issues/137
425: update the intent to be more accurate

**Individual fix**
Template 324, remove ranking requirement. 
Template 204: Use a combination of context_qa and must_include. 
792, 793 were deleted because the reason is not very sound
Fix errors found by THU group [THU-Webarena-lite Bug Fixing](https://docs.google.com/spreadsheets/d/13BRuRlU_Z_UBcucjQ5myvrRdB0P0ID3Nj-dWlzawuYo/edit#gid=1021875443) 

**Typo, grammar**
by far -> so far
https://github.com/web-arena-x/webarena/issues/133
correpong -> corresponding 
telll -> tell
canlled -> cancelled
what could -> how could
competative -> competitive

Evaluator

Support result cache so that evaluation can be run offline. This is helpful if we accept submissions in the future. The participants only needs to upload their cached files and we can perform evaluation quickly without reruning their models

shuyanzhou · 2024-09-23T05:45:26Z

@kohjingyu @ljang0 can you check a few examples on vwa to make sure it is not broken?

shuyanzhou added 7 commits June 3, 2024 20:52

support caching eval results

8c46848

Merge remote-tracking branch 'upstream/main'

329f615

add initial draft of webarena 2.0

b964245

minor

5aa35ce

merge readme files for vwa and wa

66c43af

merge docker setup readme

cc7dbd1

Merge remote-tracking branch 'upstream/main'

6fc8952

shuyanzhou changed the title ~~Document update and evaluation update [WIP]~~ Document update and evaluation update Sep 23, 2024

shuyanzhou changed the title ~~Document update and evaluation update~~ Documentation, WebArena 2.0, Evaluation Cache Sep 23, 2024

minor

9790d0c

kohjingyu mentioned this pull request Nov 3, 2024

Wrong evaluation #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation, WebArena 2.0, Evaluation Cache #62

Documentation, WebArena 2.0, Evaluation Cache #62

shuyanzhou commented Sep 8, 2024 •

edited

Loading

shuyanzhou commented Sep 23, 2024

Documentation, WebArena 2.0, Evaluation Cache #62

Are you sure you want to change the base?

Documentation, WebArena 2.0, Evaluation Cache #62

Conversation

shuyanzhou commented Sep 8, 2024 • edited Loading

Documentation

WebArena 2.0

Evaluator

shuyanzhou commented Sep 23, 2024

shuyanzhou commented Sep 8, 2024 •

edited

Loading