All countries

Pan-African

6 verified resources in Datasets for building in Pan-African.

AfriQA

Datasets

Cross-lingual open-retrieval question-answering dataset with human-translated QA pairs for 10 African languages (incl. Hausa, Igbo, Yoruba), totaling 12,159 examples across train/validation/test splits. From the Masakhane initiative.

Docs live
Language / NLP
Verified Jun 2026Free / CC-BY-SA 4.0

AfriSpeech-200

Datasets

Pan-African accented English speech corpus of ~200 hours covering 120 African accents from 13 countries and 2,463 speakers across clinical and general domains, with per-accent configs. Released by Intron Health.

Docs live
Speech
Verified Jun 2026Free / CC-BY-NC-SA 4.0

AfriSpeech-Dialog

Datasets

Conversational African-accented speech corpus (~6 hours) of 50 two-speaker dialogues across 11 accents (Hausa, Yoruba, Igbo, Swahili, Sesotho and others) from Nigeria, Kenya and South Africa, for ASR and speaker diarization. By Intron Health.

Docs live
Speech
Verified Jun 2026Free / CC-BY-NC-SA 4.0

MasakhaNER 2.0

Datasets

Largest high-quality named-entity-recognition corpus for 20 African languages (incl. Nigerian Pidgin, Hausa, Igbo, Yoruba) with PER/ORG/LOC/DATE tags over news-domain text, totaling ~152,786 rows. Built by the Masakhane community.

Docs live
Language / NLP
Verified Jun 2026Free / CC-BY-NC 4.0

MasakhaNEWS

Datasets

News-topic-classification dataset for 16 widely spoken African languages (incl. Hausa, Igbo, Yoruba, Nigerian Pidgin), ~31,088 rows in CSV/Parquet with train/val/test splits across seven topic categories. Built by the Masakhane community.

Docs live
Language / NLP
Verified Jun 2026Free / CC-BY-NC 4.0

MasakhaPOS

Datasets

Part-of-speech tagging dataset for 20 African languages (incl. Nigerian Pidgin, Hausa, Igbo, Yoruba) using Universal Dependencies tags, with per-language train/validation/test splits. Built by the Masakhane community.

Docs live
Language / NLP
Verified Jun 2026Free / CC-BY-NC 4.0