Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new benchmark MAIR #1425

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

sunnweiwei
Copy link

@sunnweiwei sunnweiwei commented Nov 10, 2024

Fixes #1426

  • Added MAIR (https://arxiv.org/abs/2410.10127, EMNLP 2024), a diverse benchmark for instructed IR.
  • The data class is defined in mteb/tasks/MAIR/eng/MAIR.py, generating 126 data classes for the 126 tasks in MAIR on the fly.
  • In benchmarks/benchmarks.py, the benchmark configuration has been added.
  • Tested several models, and the results are consistent with those of the original repo: https://github.com/sunnweiwei/mair.

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

The added data is introduced in https://arxiv.org/abs/2410.10127, which introduces a benchmark for instructable information retrieval. It contains 126 real-world retrieval tasks across 6 domains, with instructions manually annotated. And the data has been sampled to reduce evaluation costs.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@shizhl-code
Copy link

Following the above process, I am currently open a pull request in https://github.com/embeddings-benchmark/results to submit our evaluation results to the newly-added MAIR benchmark.
However, I am not very clear about the format of the result file.

@Samoed
Copy link
Collaborator

Samoed commented Nov 10, 2024

When you run your tasks, MTEB will generate a folder with results from your runs, and you can submit that folder

return
self.corpus, self.queries, self.relevant_docs = {}, {}, {}
queries_path = self.metadata_dict["dataset"]["path"]
docs_path = self.metadata_dict["dataset"]["path"].replace("-Queries", "-Docs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you place queries and docs in same repo?

Copy link
Author

@sunnweiwei sunnweiwei Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. To keep Q/D in one repo, I could create a separate repo for each task.

But is that necessary? I think having two repo for Q and D would be easier to manage than having over hundreds of repo for each task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create different splits for queries and documents in the same repo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see.

One issue is that the data in MAIR has a two-level structure: task → subtasks, as some tasks contain multiple subtasks (e.g., IFEval, SWE-Bench). It’s tricky to maintain this structure without flattening it into a single repository.

And I think people may not need to download all the data if they’re only interested in evaluating a few specific tasks. So, if we still need to put Q and D in a single repo, the best way might be to generate 126 separate repo (for each task).

Or do you have any other suggestions?

@shizhl-code
Copy link

Another question is that our benchmark has two settings, i.e., evaluting the model with and without instruction. Should I store the result with instruction and without instruction into two files, respectively?
Appreciate for any feedback and response!

@Samoed
Copy link
Collaborator

Samoed commented Nov 10, 2024

If you have results for both instruct and non-instruct, it might be better to create separate tasks, though @orionw might have a clearer perspective on this

@orionw
Copy link
Contributor

orionw commented Nov 10, 2024

+1 to adding a duplicate task if you have a specific instruction you want them to use for each. Otherwise models can define their own instructions and in that case you could just submit results to the same task but with a different prompt in the meta info.

If you’re adding an instruction variant (and once #1359 is in) you’d just need to add a version of those tasks with all the same attributes but also a config/attribute called “self.instruction” (query-id -> instruction_text) format

@sunnweiwei
Copy link
Author

Hi. If we have duplicate tasks with different instructions, will they appear in separate tables on the leaderboard? Like would there be a one called (XXX with instruction) and another called (XXX without instruction)?

@orionw
Copy link
Contributor

orionw commented Nov 10, 2024

@sunnweiwei they would appear as different datasets yes. So you could have one leaderboard with them and one without, if desired. Or push it all together into one benchmark.

Does that answer the question? Or do you mean more than one instruction per dataset?

@sunnweiwei
Copy link
Author

Thanks for the answer! I was thinking to put them into one table for benchmarking purpose, maybe adding a column to indicate if instructions were used. Then people could compare models with and without instructions in the same table. Good to know we can do this then.

@Muennighoff
Copy link
Contributor

Would be great to get this in @sunnweiwei in case you're still working on it; I think it'll be very useful to the community!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new benchmark MAIR
5 participants