Standardize the expected type of predictions despite differences in the number of tasks and test sets in a benchmark/competition #178

Andrewq11 · 2024-08-20T16:52:05Z

Context

Benchmarks can be created on Polaris which contain N number of tasks and N number of test sets. This introduces 4 classes of benchmarks which Polaris must support during the evaluation of user predictions. These classes are as follows:

Single task, single set
Multi-task, single set
Single task, multi-set
Multi-task, multi-set

Currently, evaluation methods in the Polaris library expect different structures for the submitted predictions depending on the category of a benchmark. The following is a breakdown of those expectations:

Single task, single set -> [values]
Multi-task, single set -> {task_name_1: [values], task_name_2: [values], ...}
Single task, multi-set and multi-task, multi-set -> {test_set_1: {task_name: [values], ...}, ...}

Description

This current configuration is not only confusing, but also unneeded. We should standardize the expected structure of predictions to be the same irrespective of the benchmark type. The following should be the expected structure of the predictions across all benchmark types:

{test_name: {task_name: [values], ...}, ...}

This is the structure that is currently supported for single task, multi-set and multi-task, multi-set benchmarks.

Acceptance Criteria

Evaluation methods are updated to expect the aforementioned single structure regardless of the benchmark type
Benchmark tests are updated and pass with the new prediction structure
The docstring for the evaluate method in the BenchmarkSpecification class is updated accordingly
Evaluation methods continue to produce reliable and accurate results after the above changes are applied

The text was updated successfully, but these errors were encountered:

cwognum · 2024-08-20T16:53:23Z

Seems to overlap (at least partially) with #169

kirahowe self-assigned this Aug 20, 2024

kirahowe linked a pull request Sep 1, 2024 that will close this issue

Chore: First class predictions #187

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize the expected type of predictions despite differences in the number of tasks and test sets in a benchmark/competition #178

Standardize the expected type of predictions despite differences in the number of tasks and test sets in a benchmark/competition #178

Andrewq11 commented Aug 20, 2024

cwognum commented Aug 20, 2024

Standardize the expected type of predictions despite differences in the number of tasks and test sets in a benchmark/competition #178

Standardize the expected type of predictions despite differences in the number of tasks and test sets in a benchmark/competition #178

Comments

Andrewq11 commented Aug 20, 2024

Context

Description

Acceptance Criteria

cwognum commented Aug 20, 2024