SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon.
The new evaluation datasets included in SpanishBench are:
Task | Category | Homepage |
---|---|---|
COPA-es | Commonsense Reasoning | https://huggingface.co/datasets/BSC-LT/COPA-es |
OpenBookQA_es | Question Answering | https://huggingface.co/datasets/BSC-LT/openbookqa-es |
The datasets included in SpanishBench that have been made public in previous publications are:
Paper for SpanishBench coming soon.
spanish_bench
: All tasks included in SpanishBench.flores_es
: All FLORES translation tasks from or to Spanish.
phrases_es
: Two Phrases_va tasks for language adaptation between Spanish and Valencian.
The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.
belebele_spa_Latn
copa_es
escola
flores_es
flores_es-ca
flores_es-de
flores_es-en
flores_es-eu
flores_es-fr
flores_es-gl
flores_es-it
flores_es-pt
flores_ca-es
flores_de-es
flores_en-es
flores_eu-es
flores_fr-es
flores_gl-es
flores_it-es
flores_pt-es
mgsm_direct_es_spanish_bench
(spanish_bench
is due to an existing open issue in the original task)openbookqa_es
paws_es_spanish_bench
(spanish_bench
is due to an existing open issue in the original task)phrases_es
wnli_es
xlsum_es
xnli_es_spanish_bench
(spanish_bench
is due to an existing open issue in the original task)xquad_es
xstorycloze_es
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
belebele_spa_Latn
: Belebele Spanishmgsm_direct_es
: MGSM Spanish (fixed an existing open issue in the original task)paws_es
: PAWS-X Spanish (fixed an existing open issue in the original task)xnli_es
: XNLI Spanish (fixed an existing open issue in the original task)xstorycloze_es
: XStoryCloze Spanish
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
- Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?