Jean/non hf db models #48

jmercat · 2025-01-09T02:10:38Z

Change some DB functions to handle non-huggingface models

git-subtree-dir: eval/chat_benchmarks/LiveBench git-subtree-split: 48b283e

…_benchmarks/LiveBench'

ab834ea update livebench releases in download_questions 3055d6a Merge pull request #68 from dylanjcastillo/main 8edea2b update readme a58e552 Update README.md a0fc4fe Update changelog.md 1f37ca3 update changelog fb1380a add changelog 1c779c2 Merge remote-tracking branch 'upstream/main' 85cae69 Update README.md f050f8c Merge pull request #48 from codepharmer/patch-1 3953c6c fix bug in download_leaderboard 0678e4e bump version e84e548 add debugging option for AMPS_Hard, math_comp, and web_of_lies_v2 f774d19 handle edge case in tablejoin judging ee0b887 handle edge case in math olympiad judging 7104134 update api model lists 4de3327 add --livebench-release-option to all runner files b87e779 Add line separator 174d07c update figure 37b5027 update model_adapter.py 0870422 Update README.md 850d09b update leaderboard 5702272 Merge pull request #30 from johnfelipe/main ed36110 allow boxed output format for spatial reasoning task 8f7aade add support for nvidia models and remove hard-coded model in vertexai ffc2364 Create README-es_CO.md 6f3f1e3 add livebench_releases param in download_questions 1fd5165 Remove superfluous arg. ecc86ce add gpt-4o-2024-08-06 3e8328c Update README.md 4f2dfff add gemini-1.5-pro-exp-0801 2fc42c5 update leaderboard 0ca7c40 Merge pull request #20 from LiveBench/july_release 94fa419 README updates for july release 135ef7c Files changes for release 2 e424f7b add scoring for new spatial reasoning task 8df6cd5 update tablereformat scoring to allow for leading phrase 989b5b9 add other Llama-3.1 models 9226afe fixed mistral-large-2407 categorization from api to open 5963499 add mistral 4a0719c add Llama-3.1 70B and 8B e921527 add llama-3.1-405b to leaderboard 40c0805 update math_comp scoring 4dc9304 fix bug in gen_model_answer.py 8316b5a update leaderboard and readme 93e3a7d add gpt-4o-mini 1264bb0 Merge pull request #16 from LiveBench/usability_updates_july 95dbca5 fix bug in gen_ground_truth_judgment 9640696 July usability updates 428e1e6 Update README.md b98e0b6 make web_of_lies_v2 parsing less strict f6ddc97 remove house_traversal because of ambiguous parsing ca35a04 allow boxed output format for math_comp 028266e Merge pull request #10 from eltociear/patch-1 7503f15 update leaderboard 5556d06 docs: update DATASHEET.md fb10035 update leaderboard 24e6c89 add claude-3-5 to anthropic model list d008154 Merge pull request #8 from LiveBench/dev/colin ecd8530 update leaderboard 977a06c make house_traversal judging stricter 991124c Update pyproject.toml fa94aaf Update CONTRIBUTING.md 192baf2 Update README.md ca372cf Merge pull request #4 from sanjay920/fix_typo 905b71f inside the `tablejoin` task processing utility, handle the case if the `ground_truth` variable is not a dict (isntead a string) and use `ast.literal_eval` to convert it to a dict which is necessary for processing and comparing the ground truth to the llm output 19b94ab Merge pull request #1 from superoo7/patch-1 e896ff2 renamed livebench/show_livebench_result.py -> livebench/show_livebench_results.py ab063ff feat: add wheel and langdetect to dependencies 03156c0 add documentation 5dd3657 Merge pull request #3 from LiveBench/dev/task_cat_update 2347dec update to LiveBench to - unify the nomenclature of "category" (a meta-group of which at this time there are 6) and "task" (a specific type of questions, of which at this time there are 18). this includes removing the `grouping` key from various files (which was done to hf datasets) - improve the `download_leaderboard.py` script to separate each model's answers into its own jsonl file per task - update the processing of `instruction_following` tasks in order to manage this change (updates to `livebench/lf_runner/instruction_following_eval/evaluation_main.py` and `livebench/process_results/instruction_following/utils.py - permit trailing `/` when running a command like `python gen_ground_truth_judgment.py --bench-name live_bench/` 14872ce README asset fix d4ecb47 Update README.md 9ac5f85 Update README.md 0fb5e4a Update README.md c01596b Delete livebench/if_runner/live_data.py 262cfc1 Delete livebench/if_runner/README.md 05da363 Delete livebench/if_runner/Number_of_Instructions_(IFEval_Live_Bench).pdf 9d45efd Delete livebench/if_runner/utils.py 6b7e52c Update README.md cc71a6c Update pyproject.toml 113b307 Update README.md 686be1e LiveBench file drop 5edcd77 Create README.md REVERT: 48b283e [wip] docker build image REVERT: 326788c Merge pull request #35 from mlfoundations/etashg/check_db_results REVERT: 6af9066 linted REVERT: 1e1ba0d edited readme REVERT: 2603d19 checks if task is already done REVERT: 4878904 checks if task is already done REVERT: f8481f5 Merge pull request #34 from sedrick-keh-tri/eval-config-yaml REVERT: ff7bc85 support reading from config yamls REVERT: fae070e remove unused arguments REVERT: 563a312 Merge pull request #33 from mlfoundations/etashg/bibtex REVERT: 922170d Update README.md REVERT: e8531fe Merge pull request #32 from mlfoundations/etashg/bs_debug REVERT: a9d57a7 added bi btex REVERT: 0bd3a24 added citation REVERT: 42206b1 reverted some files to main REVERT: b332fcf added pyproject toml fix REVERT: f6ae3d9 Merge pull request #30 from SGEthan/main REVERT: cb19994 fix results missed when uploading to db REVERT: 5d6ba5a debug successful REVERT: 639af29 Merge pull request #26 from mlfoundations/jean/parallel_wildbench REVERT: d0a6946 Update pyproject.toml REVERT: 591c023 Update pyproject.toml REVERT: 4b1ddb0 Merge pull request #27 from mlfoundations/negin/clean_mtbench REVERT: 8d1be1a removed extra files REVERT: b7b27bb Multithread api call in Wildbench REVERT: 757bf3b Merge pull request #24 from mlfoundations/etashg/eval_database_fix REVERT: 9673ee2 added database fix REVERT: dcfc11d Initial commit git-subtree-dir: eval/chat_benchmarks/LiveBench git-subtree-split: ab834ea6532aa042943afc3b228372f88415473d

…ebench

jmercat · 2025-01-09T02:12:00Z

some code should not have been included

jmercat added 10 commits December 16, 2024 13:08

Squashed 'eval/chat_benchmarks/LiveBench/' content from commit 48b283e

93164b0

git-subtree-dir: eval/chat_benchmarks/LiveBench git-subtree-split: 48b283e

Merge commit '93164b0149e8fa80435a4d0578f6b80b8236ad48' as 'eval/chat…

55b0d30

…_benchmarks/LiveBench'

Merge commit '7a00dda78123b6a673174a86b9bec2a5db3a5f36' into jean/liv…

78dfd94

…ebench

[wip] ran in debug mode

3b5d62a

[wip]

6c677d8

[wip] compute metrics per task, per date and average

4a181e0

cast batch_size to int

b8652ca

use chat template from livebench

04bf2e6

allow non-HF models to be pushed to the DB

611c713

jmercat requested a review from neginraoof January 9, 2025 02:10

jmercat closed this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jean/non hf db models #48

Jean/non hf db models #48

jmercat commented Jan 9, 2025

jmercat commented Jan 9, 2025 •

edited

Loading

Jean/non hf db models #48

Jean/non hf db models #48

Conversation

jmercat commented Jan 9, 2025

jmercat commented Jan 9, 2025 • edited Loading

jmercat commented Jan 9, 2025 •

edited

Loading