Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize the expected type of predictions despite differences in the number of tasks and test sets in a benchmark/competition #178

Open
Andrewq11 opened this issue Aug 20, 2024 · 1 comment · May be fixed by #187
Assignees

Comments

@Andrewq11
Copy link
Contributor

Context

Benchmarks can be created on Polaris which contain N number of tasks and N number of test sets. This introduces 4 classes of benchmarks which Polaris must support during the evaluation of user predictions. These classes are as follows:

  • Single task, single set
  • Multi-task, single set
  • Single task, multi-set
  • Multi-task, multi-set

Currently, evaluation methods in the Polaris library expect different structures for the submitted predictions depending on the category of a benchmark. The following is a breakdown of those expectations:

  • Single task, single set -> [values]
  • Multi-task, single set -> {task_name_1: [values], task_name_2: [values], ...}
  • Single task, multi-set and multi-task, multi-set -> {test_set_1: {task_name: [values], ...}, ...}

Description

This current configuration is not only confusing, but also unneeded. We should standardize the expected structure of predictions to be the same irrespective of the benchmark type. The following should be the expected structure of the predictions across all benchmark types:

  • {test_name: {task_name: [values], ...}, ...}

This is the structure that is currently supported for single task, multi-set and multi-task, multi-set benchmarks.

Acceptance Criteria

  • Evaluation methods are updated to expect the aforementioned single structure regardless of the benchmark type
  • Benchmark tests are updated and pass with the new prediction structure
  • The docstring for the evaluate method in the BenchmarkSpecification class is updated accordingly
  • Evaluation methods continue to produce reliable and accurate results after the above changes are applied
@cwognum
Copy link
Collaborator

cwognum commented Aug 20, 2024

Seems to overlap (at least partially) with #169

@kirahowe kirahowe self-assigned this Aug 20, 2024
@kirahowe kirahowe linked a pull request Sep 1, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants