Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no targets in rummlu and others benchmarks #23

Open
thehir0 opened this issue Jun 6, 2024 · 12 comments
Open

no targets in rummlu and others benchmarks #23

thehir0 opened this issue Jun 6, 2024 · 12 comments

Comments

@thehir0
Copy link
Contributor

thehir0 commented Jun 6, 2024

in this tree: https://github.com/ai-forever/MERA/tree/update/new_harness_codebase

@thehir0
Copy link
Contributor Author

thehir0 commented Jun 6, 2024

outputs="", and targets=""

@germanjke
Copy link

germanjke commented Jun 6, 2024

We observe inconsistent task performance across different datasets, specifically, leaderboard tasks are returning a performance metric of zero while non-leaderboard tasks are performing as expected.

python lm_eval/__main__.py     --model vllm     --write_out     --output_path results.json     --model_args pretrained=/workspace/Llama-2-7B-bf16-sharded/,tensor_parallel_size=1,dtype="bfloat16",gpu_memory_utilization=0.8     --tasks rummlu,ruhatespeech...     --include_path=/benchmarks/benchmark_tasks     --num_fewshot 0 (5 same result) --log_samples

gives me

2024-06-05 02:01:50.8392Z vllm (pretrained=/tgpt/biglm/biglm/ckpts/superllama/llama3-8b-rumix-v1.5-500ba-lit-w-benchmarks-masks+gc100-const3e-5/huggingface/ba6000/,tensor_parallel_size=1,dtype=bfloat16,gpu_memory_utilization=0.6), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
2024-06-05 02:01:51.0610Z |     Tasks     |Version|Filter |n-shot|                Metric                 |Value |   |Stderr|
2024-06-05 02:01:51.0610Z |---------------|-------|-------|-----:|---------------------------------------|-----:|---|------|
2024-06-05 02:01:51.0610Z |use            |      0|metrics|     0|grade_norm                             |0.0000|±  |N/A   |
2024-06-05 02:01:51.1013Z |tape           |N/A    |metrics|     5|f1_macro                               |0.0000|±  |N/A   |
2024-06-05 02:01:51.2735Z |               |       |metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.2735Z |               |       |metrics|     5|em                                     |0.0000|±  |0.0000|
2024-06-05 02:01:51.3139Z |               |       |metrics|     5|f1                                     |0.0000|±  |0.0000|
2024-06-05 02:01:51.3541Z | - chegeka     |      0|metrics|     4|f1                                     |0.0000|±  |0.0000|
2024-06-05 02:01:51.3940Z |               |       |metrics|     4|em                                     |0.0000|±  |0.0000|
2024-06-05 02:01:51.4995Z | - multiq      |      0|metrics|     0|f1                                     |0.0000|±  |0.0000|
2024-06-05 02:01:51.4995Z |               |       |metrics|     0|em                                     |0.0000|±  |0.0000|
2024-06-05 02:01:51.5399Z | - ruopenbookqa|      0|metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.5797Z |               |       |metrics|     5|f1_macro                               |0.0000|±  |N/A   |
2024-06-05 02:01:51.6821Z | - ruworldtree |      0|metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.6824Z |               |       |metrics|     5|f1_macro                               |0.0000|±  |N/A   |
2024-06-05 02:01:51.7226Z |simplear       |      0|metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.7628Z |rutie          |      0|metrics|     0|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.8028Z |rumultiar      |      0|metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.8852Z |rumodar        |      0|metrics|     0|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.8852Z |rummlu         |      0|metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:51.9261Z |               |       |metrics|     5|acc_high_school_mathematics            |0.0000|±  |0.0000|
2024-06-05 02:01:51.9660Z |               |       |metrics|     5|acc_college_medicine                   |0.0000|±  |0.0000|
2024-06-05 02:01:52.1188Z |               |       |metrics|     5|acc_human_sexuality                    |0.0000|±  |0.0000|
2024-06-05 02:01:52.1188Z |               |       |metrics|     5|acc_high_school_geography              |0.0000|±  |0.0000|
2024-06-05 02:01:52.1594Z |               |       |metrics|     5|acc_econometrics                       |0.0000|±  |0.0000|
2024-06-05 02:01:52.1994Z |               |       |metrics|     5|acc_high_school_macroeconomics         |0.0000|±  |0.0000|
2024-06-05 02:01:52.2916Z |               |       |metrics|     5|acc_high_school_computer_science       |0.0000|±  |0.0000|
2024-06-05 02:01:52.2916Z |               |       |metrics|     5|acc_nutrition                          |0.0000|±  |0.0000|
2024-06-05 02:01:52.3323Z |               |       |metrics|     5|acc_high_school_microeconomics         |0.0000|±  |0.0000|
2024-06-05 02:01:52.3725Z |               |       |metrics|     5|acc_formal_logic                       |0.0000|±  |0.0000|
2024-06-05 02:01:52.4875Z |               |       |metrics|     5|acc_conceptual_physics                 |0.0000|±  |0.0000|
2024-06-05 02:01:52.4875Z |               |       |metrics|     5|acc_high_school_world_history          |0.0000|±  |0.0000|
2024-06-05 02:01:52.5281Z |               |       |metrics|     5|acc_moral_disputes                     |0.0000|±  |0.0000|
2024-06-05 02:01:52.5681Z |               |       |metrics|     5|acc_logical_fallacies                  |0.0000|±  |0.0000|
2024-06-05 02:01:52.6823Z |               |       |metrics|     5|acc_high_school_biology                |0.0000|±  |0.0000|
2024-06-05 02:01:52.6823Z |               |       |metrics|     5|acc_abstract_algebra                   |0.0000|±  |0.0000|
2024-06-05 02:01:52.7228Z |               |       |metrics|     5|acc_medical_genetics                   |0.0000|±  |0.0000|
2024-06-05 02:01:52.7630Z |               |       |metrics|     5|acc_marketing                          |0.0000|±  |0.0000|
2024-06-05 02:01:52.8030Z |               |       |metrics|     5|acc_college_biology                    |0.0000|±  |0.0000|
2024-06-05 02:01:52.8804Z |               |       |metrics|     5|acc_virology                           |0.0000|±  |0.0000|
2024-06-05 02:01:52.8832Z |               |       |metrics|     5|acc_world_religions                    |0.0000|±  |0.0000|
2024-06-05 02:01:52.9223Z |               |       |metrics|     5|acc_global_facts                       |0.0000|±  |0.0000|
2024-06-05 02:01:52.9625Z |               |       |metrics|     5|acc_college_computer_science           |0.0000|±  |0.0000|
2024-06-05 02:01:53.0032Z |               |       |metrics|     5|acc_high_school_government_and_politics|0.0000|±  |0.0000|
2024-06-05 02:01:53.0858Z |               |       |metrics|     5|acc_professional_medicine              |0.0000|±  |0.0000|
2024-06-05 02:01:53.0858Z |               |       |metrics|     5|acc_clinical_knowledge                 |0.0000|±  |0.0000|
2024-06-05 02:01:53.1266Z |               |       |metrics|     5|acc_jurisprudence                      |0.0000|±  |0.0000|
2024-06-05 02:01:53.1666Z |               |       |metrics|     5|acc_professional_psychology            |0.0000|±  |0.0000|
2024-06-05 02:01:53.2935Z |               |       |metrics|     5|acc_public_relations                   |0.0000|±  |0.0000|
2024-06-05 02:01:53.2935Z |               |       |metrics|     5|acc_us_foreign_policy                  |0.0000|±  |0.0000|
2024-06-05 02:01:53.3340Z |               |       |metrics|     5|acc_philosophy                         |0.0000|±  |0.0000|
2024-06-05 02:01:53.3740Z |               |       |metrics|     5|acc_management                         |0.0000|±  |0.0000|
2024-06-05 02:01:53.4840Z |               |       |metrics|     5|acc_high_school_statistics             |0.0000|±  |0.0000|
2024-06-05 02:01:53.4840Z |               |       |metrics|     5|acc_high_school_european_history       |0.0000|±  |0.0000|
2024-06-05 02:01:53.5245Z |               |       |metrics|     5|acc_miscellaneous                      |0.0000|±  |0.0000|
2024-06-05 02:01:53.5647Z |               |       |metrics|     5|acc_machine_learning                   |0.0000|±  |0.0000|
2024-06-05 02:01:53.6046Z |               |       |metrics|     5|acc_high_school_us_history             |0.0000|±  |0.0000|
2024-06-05 02:01:53.7539Z |               |       |metrics|     5|acc_electrical_engineering             |0.0000|±  |0.0000|
2024-06-05 02:01:53.7539Z |               |       |metrics|     5|acc_high_school_psychology             |0.0000|±  |0.0000|
2024-06-05 02:01:53.7942Z |               |       |metrics|     5|acc_international_law                  |0.0000|±  |0.0000|
2024-06-05 02:01:53.9261Z |               |       |metrics|     5|acc_college_mathematics                |0.0000|±  |0.0000|
2024-06-05 02:01:53.9261Z |               |       |metrics|     5|acc_professional_accounting            |0.0000|±  |0.0000|
2024-06-05 02:01:53.9662Z |               |       |metrics|     5|acc_security_studies                   |0.0000|±  |0.0000|
2024-06-05 02:01:54.0891Z |               |       |metrics|     5|acc_sociology                          |0.0000|±  |0.0000|
2024-06-05 02:01:54.0891Z |               |       |metrics|     5|acc_elementary_mathematics             |0.0000|±  |0.0000|
2024-06-05 02:01:54.1296Z |               |       |metrics|     5|acc_professional_law                   |0.0000|±  |0.0000|
2024-06-05 02:01:54.1698Z |               |       |metrics|     5|acc_prehistory                         |0.0000|±  |0.0000|
2024-06-05 02:01:54.3055Z |               |       |metrics|     5|acc_college_chemistry                  |0.0000|±  |0.0000|
2024-06-05 02:01:54.3055Z |               |       |metrics|     5|acc_high_school_physics                |0.0000|±  |0.0000|
2024-06-05 02:01:54.3458Z |               |       |metrics|     5|acc_college_physics                    |0.0000|±  |0.0000|
2024-06-05 02:01:54.3857Z |               |       |metrics|     5|acc_business_ethics                    |0.0000|±  |0.0000|
2024-06-05 02:01:54.4871Z |               |       |metrics|     5|acc_moral_scenarios                    |0.0000|±  |0.0000|
2024-06-05 02:01:54.4871Z |               |       |metrics|     5|acc_anatomy                            |0.0000|±  |0.0000|
2024-06-05 02:01:54.5276Z |               |       |metrics|     5|acc_computer_security                  |0.0000|±  |0.0000|
2024-06-05 02:01:54.5677Z |               |       |metrics|     5|acc_human_aging                        |0.0000|±  |0.0000|
2024-06-05 02:01:54.6847Z |               |       |metrics|     5|acc_astronomy                          |0.0000|±  |0.0000|
2024-06-05 02:01:54.6847Z |               |       |metrics|     5|acc_high_school_chemistry              |0.0000|±  |0.0000|
2024-06-05 02:01:54.7252Z |ruhumaneval    |      0|scoring|     0|pass@1                                 |0.0000|±  |0.0000|
2024-06-05 02:01:54.7651Z |               |       |scoring|     0|pass@5                                 |0.0000|±  |0.0000|
2024-06-05 02:01:54.8814Z |               |       |scoring|     0|pass@10                                |0.0000|±  |0.0000|
2024-06-05 02:01:54.8814Z |ruhhh          |      0|metrics|     0|acc                                    |0.5955|±  |0.0369|
2024-06-05 02:01:54.9218Z |               |       |metrics|     0|acc_helpful                            |0.6271|±  |0.0635|
2024-06-05 02:01:54.9620Z |               |       |metrics|     0|acc_honest                             |0.5902|±  |0.0635|
2024-06-05 02:01:55.0020Z |               |       |metrics|     0|acc_harmless                           |0.5690|±  |0.0656|
2024-06-05 02:01:55.0994Z |ruhatespeech   |      0|metrics|     0|acc                                    |0.6528|±  |0.0293|
2024-06-05 02:01:55.0994Z |               |       |metrics|     0|acc_men                                |0.8000|±  |0.0686|
2024-06-05 02:01:55.1402Z |               |       |metrics|     0|acc_lgbt                               |0.6471|±  |0.1195|
2024-06-05 02:01:55.1802Z |               |       |metrics|     0|acc_women                              |0.6204|±  |0.0469|
2024-06-05 02:01:55.3027Z |               |       |metrics|     0|acc_other                              |0.6393|±  |0.0620|
2024-06-05 02:01:55.3028Z |               |       |metrics|     0|acc_migrants                           |0.5714|±  |0.2020|
2024-06-05 02:01:55.3432Z |               |       |metrics|     0|acc_nationalities                      |0.6486|±  |0.0796|
2024-06-05 02:01:55.3831Z |ruethics       |      0|metrics|     0|mcc_correct_virtue                     |0.1118|±  |0.0477|
2024-06-05 02:01:55.5328Z |               |       |metrics|     0|mcc_correct_law                        |0.0940|±  |0.0472|
2024-06-05 02:01:55.5328Z |               |       |metrics|     0|mcc_correct_moral                      |0.0981|±  |0.0465|
2024-06-05 02:01:55.5728Z |               |       |metrics|     0|mcc_correct_justice                    |0.0907|±  |0.0475|
2024-06-05 02:01:55.7660Z |               |       |metrics|     0|mcc_correct_utilitarianism             |0.0683|±  |0.0441|
2024-06-05 02:01:55.7661Z |               |       |metrics|     0|mcc_ethical_virtue                     |0.1140|±  |0.0439|
2024-06-05 02:01:55.8068Z |               |       |metrics|     0|mcc_ethical_law                        |0.0877|±  |0.0435|
2024-06-05 02:01:55.8470Z |               |       |metrics|     0|mcc_ethical_moral                      |0.0963|±  |0.0430|
2024-06-05 02:01:55.8871Z |               |       |metrics|     0|mcc_ethical_justice                    |0.1163|±  |0.0444|
2024-06-05 02:01:55.9976Z |               |       |metrics|     0|mcc_ethical_utilitarianism             |0.0615|±  |0.0416|
2024-06-05 02:01:55.9976Z |               |       |metrics|     0|mcc_good_virtue                        |0.2004|±  |0.0420|
2024-06-05 02:01:56.0381Z |               |       |metrics|     0|mcc_good_law                           |0.1901|±  |0.0421|
2024-06-05 02:01:56.0781Z |               |       |metrics|     0|mcc_good_moral                         |0.1876|±  |0.0421|
2024-06-05 02:01:56.2023Z |               |       |metrics|     0|mcc_good_justice                       |0.2000|±  |0.0419|
2024-06-05 02:01:56.2023Z |               |       |metrics|     0|mcc_good_utilitarianism                |0.1431|±  |0.0420|
2024-06-05 02:01:56.2428Z |rudetox        |      0|scoring|     0|j                                      |0.0555|±  |0.0035|
2024-06-05 02:01:56.2828Z |               |       |scoring|     0|sta                                    |0.1260|±  |0.0072|
2024-06-05 02:01:56.3806Z |               |       |scoring|     0|sim                                    |0.8555|±  |0.0073|
2024-06-05 02:01:56.3806Z |               |       |scoring|     0|fl                                     |0.6267|±  |0.0099|
2024-06-05 02:01:56.4210Z |rsg            |N/A    |metrics|     0|f1_macro                               |0.0000|±  |N/A   |
2024-06-05 02:01:56.4612Z |               |       |metrics|     0|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.5013Z | - parus       |      0|metrics|     0|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.5906Z | - rcb         |      0|metrics|     0|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.5906Z |               |       |metrics|     0|f1_macro                               |0.0000|±  |N/A   |
2024-06-05 02:01:56.6311Z | - rwsd        |      0|metrics|     0|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.6711Z |mathlogicqa    |      0|metrics|     5|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.7761Z |lcs            |      0|metrics|     2|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.7761Z |bps            |      0|metrics|     2|acc                                    |0.0000|±  |0.0000|
2024-06-05 02:01:56.8166Z 
2024-06-05 02:01:56.8568Z |Groups|Version|Filter |n-shot| Metric |Value|   |Stderr|
2024-06-05 02:01:56.8972Z |------|-------|-------|-----:|--------|----:|---|------|
2024-06-05 02:01:56.9816Z |tape  |N/A    |metrics|     5|f1_macro|    0|±  |N/A   |
2024-06-05 02:01:56.9816Z |      |       |metrics|     5|acc     |    0|±  |0.0000|
2024-06-05 02:01:57.0218Z |      |       |metrics|     5|em      |    0|±  |0.0000|
2024-06-05 02:01:57.0620Z |      |       |metrics|     5|f1      |    0|±  |0.0000|
2024-06-05 02:01:57.1021Z |rsg   |N/A    |metrics|     0|f1_macro|    0|±  |N/A   |
2024-06-05 02:01:57.2465Z |      |       |metrics|     0|acc     |    0|±  |0.0000|
2024-06-05 02:01:57.2465Z 

We checked all MERA tasks and we observe inconsistent task performance across different datasets, specifically, leaderboard tasks are returning a performance metric of zero while non-leaderboard tasks are performing as expected.

Leaderboard Tasks: -< works with zeros
BPS, LCS, RCB, USE, RWSD, PARus, ruTiE, MultiQ, ruMMLU, CheGeKa, ruModAr, SimpleAr, ruMultiAr, MathLogicQA, ruHumanEval, ruWorldTree, ruOpenBookQA

Non-Leaderboard Tasks: -< works fine
ruHHH, ruHateSpeech, ruDetox, ruEthics

@LSinev
Copy link
Collaborator

LSinev commented Jun 6, 2024

leaderboard tasks are returning a performance metric of zero

This is expected behaviour, as no targets provided. --predict_only should be used for such tasks.

Seems like you both get expected behaviour, just like #3
Nothing changed in this case with new codebase. No targets provided intentionally as tests are closed and supposed to be scored at site with leaderboard.

See proper way to run benchmark with provided shell script https://github.com/ai-forever/MERA/blob/update/new_harness_codebase/scripts/run_benchmark.sh and instructions https://github.com/ai-forever/MERA/blob/update/new_harness_codebase/MODEL_SCORING.md#run-full-benchmark-with-bash-script

This new code gives you ability to use splits with targets provided (not using option --predict_only), but tasks in this case are named like parus_trainscore, multiq_trainscore and so on. Scores of *_trainscore tasks are not supposed to match leaderboard at all.

@thehir0
Copy link
Contributor Author

thehir0 commented Jun 6, 2024

This new code gives you ability to use splits with targets provided (not using option --predict_only), but tasks in this case are named like parus_trainscore, multiq_trainscore and so on. Scores of *_trainscore tasks are not supposed to match leaderboard at all.

does this mean the dataset published in hf its now considered to trainscore?

@LSinev
Copy link
Collaborator

LSinev commented Jun 6, 2024

As you may see, dataset at HF has several splits. Split with no targets is used for official benchmarking through MERA website. What has changed now, inside this branch, is just we added tasks which are using splits with provided targets, not supposed to be used for leaderboard with closed test set.

You can see it is not changed at HF, and its contents can be viewed there too (screenshots of multiq dataset)
No changes in 6 months:
2024-06-07 01 28 05 huggingface co baa37586d2e9
Split with no target (used for official MERA benchmark):
2024-06-07 01 22 39 huggingface co 187e7dfef337
Split with targets provided (previously not used):
2024-06-07 01 22 26 huggingface co 1d7e5f46ba76

And here one can see how multiq_trainscore is set https://github.com/ai-forever/MERA/blob/update/new_harness_codebase/benchmark_tasks/tape/multiq_trainscore.yaml#L7

Adding *_trainscore tasks to this branch is inspired by #3 (comment)
Their usefulness is very niche in nature.

@germanjke
Copy link

Thank you!

@thehir0
Copy link
Contributor Author

thehir0 commented Jun 13, 2024

Thank you for your responses! I would like one more clarification. What is the difference between gen and non-gen tasks in the context of metrics, what is the purpose of having both. For example: bps_gen_trainscore, bps_trainscore.

@LSinev
Copy link
Collaborator

LSinev commented Jun 13, 2024

*_gen* tasks can be run with models/APIs which do not support logit outputs. Main purpose in having them both available is giving to community ways to research and find the way for better model scoring. Bigger research community may find a better way to score models with and without logit output support on the same leaderboard.
Among ideas to check are:

  • find better processing/filtering regexps for _gen tasks (look at existing patterns: digit_choice_gen_task.yaml and letter_choice_gen_task.yaml.
  • use some advanced score calculation based on both types of the same task (for models supporting both of them), for example:
    • if scores differ significantly then model will be known to be preferrably run in one mode or another (ordinary or _gen).
    • resulting score may be some combination like weighted average, or max/min.

There is also a room for a discussion if originally multiple_choice tasks may be "converted" to _gen variants without changing name of the task as methodology of scoring changes.

@thehir0
Copy link
Contributor Author

thehir0 commented Jun 17, 2024

Main purpose in having them both available is giving to community ways to research and find the way for better model scoring.

It's great to have both of them for users, but which one will be used in the leaderboard on the website after moving to version 0.4.0? Because the difference in results can be huge, especially with 0 shot tasks. At the very least it will help to somehow get closer to the private metrics before sending, simply using trainscore.

@thehir0 thehir0 closed this as completed Jun 17, 2024
@thehir0 thehir0 reopened this Jun 17, 2024
@LSinev
Copy link
Collaborator

LSinev commented Jun 17, 2024

which one will be used in the leaderboard

Start discussions in different communities. On one hand _gen version makes all models to be on the same leaderboard with APIs (which do not have logits available), on the other hand classic tasks (logit based) seems to be more academically (and thus mathematically) backed up. There may be other cons and pros.

@thehir0
Copy link
Contributor Author

thehir0 commented Jun 17, 2024

I would like to propose a solution, use generative ones, but at the same time regenerate on the same prompt several times. At the same time, it is important to shuffle the correct answer, because some models are bias for certain options. For example, I saw that the gpt 3.5 turbo prefers option C, paper

Yes, this approach has the disadvantage that regenerating n times will slow down n times, but this will bring the metric values closer to a more objective, robust value.

@LSinev
Copy link
Collaborator

LSinev commented Jun 18, 2024

Thank you for your proposal.

regenerate on the same prompt several times

Greedy generation is default one, so one need to change seed every run. Should it be the same seeds (declared publicly in code) across all tasks, across all models?

shuffle the correct answer

Some dynamic dataset creation/modification? Would you like to propose some sort of PR with example solution? public_test splits have targets provided, so may be used for proof of concept. Please provide some way to determine which answer was correct for server side (you may want to take a look at the packing code and scoring examples)

regenerating n times will slow down n times

Some combinations of models and hardware take more than 24 hours to be run on full MERA suite. "several times"x more compute power will be needed (and "time is money"). If, for some reason (hardware at least, or may be OOM due to shared compute resource, and also bugs in the code), one generation will fail in the middle of the task, the whole run should be done again.

As an addition to this idea, to save some running time, batch_size > 1 may be used as several seeds involved. At least for tasks and models that do not use big context window.

In manual mode, this concept can be fulfilled with the available means. Once several MERA-based papers are available (which will create a trend in this way of using MERA), this proposal may find its way as the "standard" way to get the leaderboard scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants