BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.
The new evaluation datasets included in BasqueBench are:
Task | Category | Homepage |
---|---|---|
MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
PIQA_eu | Question Answering | https://huggingface.co/datasets/HiTZ/PIQA-eu |
WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
The datasets included in BasqueBench that have been made public in previous pubications are:
Paper for BasqueBench coming soon.
basque_bench
: All tasks included in BasqueBench.flores_eu
: All FLORES translation tasks from or to Basque.
The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
belebele_eus_Latn
eus_exams_eu
eus_proficiency
eus_reading
eus_trivia
flores_eu
flores_eu-ca
flores_eu-de
flores_eu-en
flores_eu-es
flores_eu-fr
flores_eu-gl
flores_eu-it
flores_eu-pt
flores_ca-eu
flores_de-eu
flores_en-eu
flores_es-eu
flores_fr-eu
flores_gl-eu
flores_it-eu
flores_pt-eu
mgsm_direct_eu
mgsm_native_cot_eu
piqa_eu
qnlieu
wnli_eu
xcopa_eu
xnli_eu
xnli_eu_native
xstorycloze_eu
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
belebele_eus_Latn
: Belebele Basqueqnlieu
: From BasqueGLUE
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
- Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?