-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no targets in rummlu and others benchmarks #23
Comments
outputs="", and targets="" |
We observe inconsistent task performance across different datasets, specifically, leaderboard tasks are returning a performance metric of zero while non-leaderboard tasks are performing as expected.
gives me
We checked all MERA tasks and we observe inconsistent task performance across different datasets, specifically, leaderboard tasks are returning a performance metric of zero while non-leaderboard tasks are performing as expected. Leaderboard Tasks: -< works with zeros Non-Leaderboard Tasks: -< works fine |
This is expected behaviour, as no targets provided. Seems like you both get expected behaviour, just like #3 See proper way to run benchmark with provided shell script https://github.com/ai-forever/MERA/blob/update/new_harness_codebase/scripts/run_benchmark.sh and instructions https://github.com/ai-forever/MERA/blob/update/new_harness_codebase/MODEL_SCORING.md#run-full-benchmark-with-bash-script This new code gives you ability to use splits with targets provided (not using option |
does this mean the dataset published in hf its now considered to trainscore? |
As you may see, dataset at HF has several splits. Split with no targets is used for official benchmarking through MERA website. What has changed now, inside this branch, is just we added tasks which are using splits with provided targets, not supposed to be used for leaderboard with closed test set. You can see it is not changed at HF, and its contents can be viewed there too (screenshots of multiq dataset) And here one can see how Adding |
Thank you! |
Thank you for your responses! I would like one more clarification. What is the difference between gen and non-gen tasks in the context of metrics, what is the purpose of having both. For example: bps_gen_trainscore, bps_trainscore. |
There is also a room for a discussion if originally multiple_choice tasks may be "converted" to _gen variants without changing name of the task as methodology of scoring changes. |
It's great to have both of them for users, but which one will be used in the leaderboard on the website after moving to version 0.4.0? Because the difference in results can be huge, especially with 0 shot tasks. At the very least it will help to somehow get closer to the private metrics before sending, simply using trainscore. |
Start discussions in different communities. On one hand |
I would like to propose a solution, use generative ones, but at the same time regenerate on the same prompt several times. At the same time, it is important to shuffle the correct answer, because some models are bias for certain options. For example, I saw that the gpt 3.5 turbo prefers option C, paper Yes, this approach has the disadvantage that regenerating n times will slow down n times, but this will bring the metric values closer to a more objective, robust value. |
Thank you for your proposal.
Greedy generation is default one, so one need to change seed every run. Should it be the same seeds (declared publicly in code) across all tasks, across all models?
Some dynamic dataset creation/modification? Would you like to propose some sort of PR with example solution?
Some combinations of models and hardware take more than 24 hours to be run on full MERA suite. "several times"x more compute power will be needed (and "time is money"). If, for some reason (hardware at least, or may be OOM due to shared compute resource, and also bugs in the code), one generation will fail in the middle of the task, the whole run should be done again. As an addition to this idea, to save some running time, batch_size > 1 may be used as several seeds involved. At least for tasks and models that do not use big context window. In manual mode, this concept can be fulfilled with the available means. Once several MERA-based papers are available (which will create a trend in this way of using MERA), this proposal may find its way as the "standard" way to get the leaderboard scores. |
in this tree: https://github.com/ai-forever/MERA/tree/update/new_harness_codebase
The text was updated successfully, but these errors were encountered: