Datasets

AfriQA vs Kencorpus Kenyan Language Corpus

A verified, side-by-side comparison. Both records are status-checked by Findra, so you are comparing what each actually offers today, not a stale listing.

Category
Datasets
Datasets
Type
Language / NLP
Language / NLP
Country
🌍 Pan-African
🏳️ Kenya
Docs status
Docs live
Docs live
Licensing
Pricing
Free / CC-BY-SA 4.0
Free / open
Verified
Verified
Unverified
Last verified
5 Jul 2026
5 Jul 2026
Tags
nlp, african-languages, question-answering, cross-lingual, open-retrieval
nlp, corpus, swahili, dholuo, luhya
Summary
Cross-lingual open-retrieval question-answering dataset with human-translated QA pairs for 10 African languages (incl. Hausa, Igbo, Yoruba), totaling 12,159 examples across train/validation/test splits. From the Masakhane initiative.
Text and speech corpus for three Kenyan languages, Swahili, Dholuo and Luhya, containing 4,442 texts (5.6 million words) and 1,152 speech files (177 hours). It also ships derived NLP sets: POS-tagged Dholuo/Luhya, 7,537 Swahili question-answer pairs and 13,400 translated sentences. Downloadable from Harvard Dataverse; released 2022.