You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benchmarks can be created on Polaris which contain N number of tasks and N number of test sets. This introduces 4 classes of benchmarks which Polaris must support during the evaluation of user predictions. These classes are as follows:
Single task, single set
Multi-task, single set
Single task, multi-set
Multi-task, multi-set
Currently, evaluation methods in the Polaris library expect different structures for the submitted predictions depending on the category of a benchmark. The following is a breakdown of those expectations:
Single task, single set -> [values]
Multi-task, single set -> {task_name_1: [values], task_name_2: [values], ...}
Single task, multi-set and multi-task, multi-set -> {test_set_1: {task_name: [values], ...}, ...}
Description
This current configuration is not only confusing, but also unneeded. We should standardize the expected structure of predictions to be the same irrespective of the benchmark type. The following should be the expected structure of the predictions across all benchmark types:
{test_name: {task_name: [values], ...}, ...}
This is the structure that is currently supported for single task, multi-set and multi-task, multi-set benchmarks.
Acceptance Criteria
Evaluation methods are updated to expect the aforementioned single structure regardless of the benchmark type
Benchmark tests are updated and pass with the new prediction structure
The docstring for the evaluate method in the BenchmarkSpecification class is updated accordingly
Evaluation methods continue to produce reliable and accurate results after the above changes are applied
The text was updated successfully, but these errors were encountered:
Context
Benchmarks can be created on Polaris which contain N number of tasks and N number of test sets. This introduces 4 classes of benchmarks which Polaris must support during the evaluation of user predictions. These classes are as follows:
Currently, evaluation methods in the Polaris library expect different structures for the submitted predictions depending on the category of a benchmark. The following is a breakdown of those expectations:
[values]
{task_name_1: [values], task_name_2: [values], ...}
{test_set_1: {task_name: [values], ...}, ...}
Description
This current configuration is not only confusing, but also unneeded. We should standardize the expected structure of predictions to be the same irrespective of the benchmark type. The following should be the expected structure of the predictions across all benchmark types:
{test_name: {task_name: [values], ...}, ...}
This is the structure that is currently supported for single task, multi-set and multi-task, multi-set benchmarks.
Acceptance Criteria
evaluate
method in theBenchmarkSpecification
class is updated accordinglyThe text was updated successfully, but these errors were encountered: