AfricaNLP-Public-Datasets

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Datasets per task (Randomly ordered)

Machine Translation

JW300: A parallel text dataset of 417 languages, including 101 African languages.
TANZIL: A translated Quran to 45 languages, including African languages such as Amharic, Hausa, Somali, and Swahili.
MENYO-20k: A Yorùbá-English multi-domain parallel text dataset.
FFR: A Fon-French parallel text dataset.
Hausa Corpus: A Hausa-English parallel text dataset.
CCAligned: A parallel text dataset for English and 137 languages, including 30 African Languages.
ParaCrawl: A parallel text dataset for 41 languages, including Somali and Swahili.
WikiMatrix: A parallel text dataset for 85 languages, including Swahili, Malagasy, and Egyptian Arabic.
Ethiopian MT datasets: A parallel text dataset for English paired with 7 Ethiopian languages.
English-Luganda: An English-Luganda parallel text dataset.
French-Fon and French-Ewe: A parallel text dataset for French paired with Fon and Ewe.
Amharic-English: An Amharic-English parallel text dataset.
Tigrinya-English: A Tigrinya-English parallel text dataset (Free registration required).
Lingala-French: A Lingala-English parallel text dataset (Free registration required).
Congolese Swahili-French (Min,Small,Medium): Congolese Swahili-French parallel text datasets (Free registration required).
Swahili-French: A synthetic Swahili-French parallel text dataset (Free registration required).
English-Hausa (Min, Small): English-Hausa parallel text datasets (Free registration required).
English-Swahili: An English-Swahili parallel text dataset (Free registration required).
English-Kanuri: An English-Kanuri parallel text dataset (Free registration required).
English-Akuapem Twi: An English-Akwapem Twi parallel text dataset.
FLORES-101: A parallel text dataset for 101 languages, including 18 African languages.

Text Classification

KINNEWS and KIRNEWS: News Classification datasets for Kinyarwanda (KINNEWS) and Kirundi (KIRNEWS).
Setswana and Sepedi: News classification datasets for Setswana and Sepedi.
Swahili News: A news classification dataset for Swahili.
Amharic News Text classification: News text classification dataset for Amharic.
VOA Hausa and BBC Yoruba news classification: News title classification dataset for Hausa and Yoruba.

Sentiment Analysis

TUNIZI: A Tunizian Arabizi sentiment analysis dataset.

Text Summarization

Amharic Summarization: A dataset for Amharic abstractive text summarization.

Named Entity Recognition

MasakhaNER: A dataset for Named Entity Recognition of 10 African languages.
WikiANN: A dataset for Named Entity Recognition for 282 languages, including several African languages.
Yoruba GV NER: Yoruba Named Entity Recognition dataset.
Hausa VOA NER: Hausa Named Entity Recognition dataset

Automated Speech Recognition (ASR)

ALFFA: An ASR dataset for Amharic, Hausa, Swahili, and Wolof.
AMMI ASR dataset: An ASR dataset for 19 Languages, including 16 African Languages.
CommonVoice: An ongoing ASR dataset project for 60 languages (as of May, 2021), including Kinyarwanda, Kabyle, and Luganda.
Fon: An ASR dataset for Fon.
Swahili: A Swahili speech dataset (Free registration required).
Congolese Swahili: A Congolese Swahili speech dataset (Free registration required).

Speech Translation

Mboshi: Mboshi-French parallel speech dataset.
IWSLT 2021 Speech Translation: Speech translation datasets for Swahili - English and Congolese Swahili-French.

Monolingual Data

Swahili Language Modeling: A Swahili dataset for language modeling and additional datasets for Swahili Syllabic Alphabet and Swahili Word Analogy.
OSCAR: A multilingual dataset for 166 languages, including Amharic, Somalia, Yoruba, Egyptian Arabic, Malagasy, Swahili, and Afrikaans.

Contributions

This is a growing list of NLP datasets for African languages. Please, if there is any publicly available dataset I missed out, kindly feel free to do a pull request or email me at [email protected] to add it.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AfricaNLP-Public-Datasets

Datasets per task (Randomly ordered)

Machine Translation

Text Classification

Sentiment Analysis

Text Summarization

Named Entity Recognition

Automated Speech Recognition (ASR)

Speech Translation

Monolingual Data

Contributions

About

Releases

Packages

OmondiKevin/africanlp-public-datasets

Folders and files

Latest commit

History

Repository files navigation

AfricaNLP-Public-Datasets

Datasets per task (Randomly ordered)

Machine Translation

Text Classification

Sentiment Analysis

Text Summarization

Named Entity Recognition

Automated Speech Recognition (ASR)

Speech Translation

Monolingual Data

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages