Pan-African
6 verified resources in Datasets for building in Pan-African.
AfriQA
Cross-lingual open-retrieval question-answering dataset with human-translated QA pairs for 10 African languages (incl. Hausa, Igbo, Yoruba), totaling 12,159 examples across train/validation/test splits. From the Masakhane initiative.
AfriSpeech-200
Pan-African accented English speech corpus of ~200 hours covering 120 African accents from 13 countries and 2,463 speakers across clinical and general domains, with per-accent configs. Released by Intron Health.
AfriSpeech-Dialog
Conversational African-accented speech corpus (~6 hours) of 50 two-speaker dialogues across 11 accents (Hausa, Yoruba, Igbo, Swahili, Sesotho and others) from Nigeria, Kenya and South Africa, for ASR and speaker diarization. By Intron Health.
MasakhaNER 2.0
Largest high-quality named-entity-recognition corpus for 20 African languages (incl. Nigerian Pidgin, Hausa, Igbo, Yoruba) with PER/ORG/LOC/DATE tags over news-domain text, totaling ~152,786 rows. Built by the Masakhane community.
MasakhaNEWS
News-topic-classification dataset for 16 widely spoken African languages (incl. Hausa, Igbo, Yoruba, Nigerian Pidgin), ~31,088 rows in CSV/Parquet with train/val/test splits across seven topic categories. Built by the Masakhane community.
MasakhaPOS
Part-of-speech tagging dataset for 20 African languages (incl. Nigerian Pidgin, Hausa, Igbo, Yoruba) using Universal Dependencies tags, with per-language train/validation/test splits. Built by the Masakhane community.
