-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jean/non hf db models #48
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
git-subtree-dir: eval/chat_benchmarks/LiveBench git-subtree-split: 48b283e
…_benchmarks/LiveBench'
ab834ea update livebench releases in download_questions 3055d6a Merge pull request #68 from dylanjcastillo/main 8edea2b update readme a58e552 Update README.md a0fc4fe Update changelog.md 1f37ca3 update changelog fb1380a add changelog 1c779c2 Merge remote-tracking branch 'upstream/main' 85cae69 Update README.md f050f8c Merge pull request #48 from codepharmer/patch-1 3953c6c fix bug in download_leaderboard 0678e4e bump version e84e548 add debugging option for AMPS_Hard, math_comp, and web_of_lies_v2 f774d19 handle edge case in tablejoin judging ee0b887 handle edge case in math olympiad judging 7104134 update api model lists 4de3327 add --livebench-release-option to all runner files b87e779 Add line separator 174d07c update figure 37b5027 update model_adapter.py 0870422 Update README.md 850d09b update leaderboard 5702272 Merge pull request #30 from johnfelipe/main ed36110 allow boxed output format for spatial reasoning task 8f7aade add support for nvidia models and remove hard-coded model in vertexai ffc2364 Create README-es_CO.md 6f3f1e3 add livebench_releases param in download_questions 1fd5165 Remove superfluous arg. ecc86ce add gpt-4o-2024-08-06 3e8328c Update README.md 4f2dfff add gemini-1.5-pro-exp-0801 2fc42c5 update leaderboard 0ca7c40 Merge pull request #20 from LiveBench/july_release 94fa419 README updates for july release 135ef7c Files changes for release 2 e424f7b add scoring for new spatial reasoning task 8df6cd5 update tablereformat scoring to allow for leading phrase 989b5b9 add other Llama-3.1 models 9226afe fixed mistral-large-2407 categorization from api to open 5963499 add mistral 4a0719c add Llama-3.1 70B and 8B e921527 add llama-3.1-405b to leaderboard 40c0805 update math_comp scoring 4dc9304 fix bug in gen_model_answer.py 8316b5a update leaderboard and readme 93e3a7d add gpt-4o-mini 1264bb0 Merge pull request #16 from LiveBench/usability_updates_july 95dbca5 fix bug in gen_ground_truth_judgment 9640696 July usability updates 428e1e6 Update README.md b98e0b6 make web_of_lies_v2 parsing less strict f6ddc97 remove house_traversal because of ambiguous parsing ca35a04 allow boxed output format for math_comp 028266e Merge pull request #10 from eltociear/patch-1 7503f15 update leaderboard 5556d06 docs: update DATASHEET.md fb10035 update leaderboard 24e6c89 add claude-3-5 to anthropic model list d008154 Merge pull request #8 from LiveBench/dev/colin ecd8530 update leaderboard 977a06c make house_traversal judging stricter 991124c Update pyproject.toml fa94aaf Update CONTRIBUTING.md 192baf2 Update README.md ca372cf Merge pull request #4 from sanjay920/fix_typo 905b71f inside the `tablejoin` task processing utility, handle the case if the `ground_truth` variable is not a dict (isntead a string) and use `ast.literal_eval` to convert it to a dict which is necessary for processing and comparing the ground truth to the llm output 19b94ab Merge pull request #1 from superoo7/patch-1 e896ff2 renamed livebench/show_livebench_result.py -> livebench/show_livebench_results.py ab063ff feat: add wheel and langdetect to dependencies 03156c0 add documentation 5dd3657 Merge pull request #3 from LiveBench/dev/task_cat_update 2347dec update to LiveBench to - unify the nomenclature of "category" (a meta-group of which at this time there are 6) and "task" (a specific type of questions, of which at this time there are 18). this includes removing the `grouping` key from various files (which was done to hf datasets) - improve the `download_leaderboard.py` script to separate each model's answers into its own jsonl file per task - update the processing of `instruction_following` tasks in order to manage this change (updates to `livebench/lf_runner/instruction_following_eval/evaluation_main.py` and `livebench/process_results/instruction_following/utils.py - permit trailing `/` when running a command like `python gen_ground_truth_judgment.py --bench-name live_bench/` 14872ce README asset fix d4ecb47 Update README.md 9ac5f85 Update README.md 0fb5e4a Update README.md c01596b Delete livebench/if_runner/live_data.py 262cfc1 Delete livebench/if_runner/README.md 05da363 Delete livebench/if_runner/Number_of_Instructions_(IFEval_Live_Bench).pdf 9d45efd Delete livebench/if_runner/utils.py 6b7e52c Update README.md cc71a6c Update pyproject.toml 113b307 Update README.md 686be1e LiveBench file drop 5edcd77 Create README.md REVERT: 48b283e [wip] docker build image REVERT: 326788c Merge pull request #35 from mlfoundations/etashg/check_db_results REVERT: 6af9066 linted REVERT: 1e1ba0d edited readme REVERT: 2603d19 checks if task is already done REVERT: 4878904 checks if task is already done REVERT: f8481f5 Merge pull request #34 from sedrick-keh-tri/eval-config-yaml REVERT: ff7bc85 support reading from config yamls REVERT: fae070e remove unused arguments REVERT: 563a312 Merge pull request #33 from mlfoundations/etashg/bibtex REVERT: 922170d Update README.md REVERT: e8531fe Merge pull request #32 from mlfoundations/etashg/bs_debug REVERT: a9d57a7 added bi btex REVERT: 0bd3a24 added citation REVERT: 42206b1 reverted some files to main REVERT: b332fcf added pyproject toml fix REVERT: f6ae3d9 Merge pull request #30 from SGEthan/main REVERT: cb19994 fix results missed when uploading to db REVERT: 5d6ba5a debug successful REVERT: 639af29 Merge pull request #26 from mlfoundations/jean/parallel_wildbench REVERT: d0a6946 Update pyproject.toml REVERT: 591c023 Update pyproject.toml REVERT: 4b1ddb0 Merge pull request #27 from mlfoundations/negin/clean_mtbench REVERT: 8d1be1a removed extra files REVERT: b7b27bb Multithread api call in Wildbench REVERT: 757bf3b Merge pull request #24 from mlfoundations/etashg/eval_database_fix REVERT: 9673ee2 added database fix REVERT: dcfc11d Initial commit git-subtree-dir: eval/chat_benchmarks/LiveBench git-subtree-split: ab834ea6532aa042943afc3b228372f88415473d
some code should not have been included |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change some DB functions to handle non-huggingface models