Required dataformat for corpus.jsonl, queries.jsonl, test.tsv #584
-
Hello, I want use the mteb to evaluate my local emdedding model, trained on local data. In addtion, I want use a self-created data set for the evaluation. When I run my code I get the following exception:
I wondering if this might due to my formations of the corpus.jsonl, queries.jsonl and test.tsv files. Unfortunately, I wasn't able to find a definition, how those files have to be formatted. In the metadata is specified data as ./data/embedding-finetuning I tried to follow this introduction: https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md Thank you so much! Here is my entire script: from mteb import MTEB
from mteb.abstasks.AbsTaskRetrieval import AbsTaskRetrieval
from sentence_transformers import SentenceTransformer
from mteb.abstasks.TaskMetadata import TaskMetadata
class LocalDataRetrieval(AbsTaskRetrieval):
metadata = TaskMetadata(
name="LocalDataR",
description="Description of the task", # Provide a descriptive string here
reference=None,
type="Retrieval",
category="s2s", # Specify the category here
eval_splits=["test"],
eval_langs=["deu-Latn"],
main_score="map",
dataset={"path": "./data/embedding-finetuning", "revision": None},
date=("2012-01-01", "2020-01-01"), # Provide start and end dates here
form=None,
domains=None,
task_subtypes=None,
license=None,
socioeconomic_status=None,
annotations_creators=None,
dialect=None,
text_creation=None,
n_samples={"test": 19599},
avg_character_length={"test": 69.0},
bibtex_citation=None
)
# testing the task with a model:
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
evaluation = MTEB(tasks=["LocalDataR"])
evaluation.run(model) #add outputfolder |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hello, The issue is that creating a new task by inheriting from an AbsTask will use the default |
Beta Was this translation helpful? Give feedback.
Hello,
The issue is that creating a new task by inheriting from an AbsTask will use the default
load_data()
function and its default behavior, this function expects a HF Dataset as input (see default behavior here.All you need to do is to overwrite
load_data()
and add your methods to open the JSON and TSV files and push them to a dictionnary, you can find examples of this overwriting in some Retrieval tasks.