Datasets

Kencorpus Kenyan Language Corpus

Datasets
Language / NLP
Docs live

Text and speech corpus for three Kenyan languages, Swahili, Dholuo and Luhya, containing 4,442 texts (5.6 million words) and 1,152 speech files (177 hours). It also ships derived NLP sets: POS-tagged Dholuo/Luhya, 7,537 Swahili question-answer pairs and 13,400 translated sentences. Downloadable from Harvard Dataverse; released 2022.

Category
Datasets
Pricing
Free / open
Country
🏳️ Kenya
Last verified
5 Jul 2026

Tags

nlp
corpus
swahili
dholuo
luhya

Compare Kencorpus Kenyan Language Corpus

Side-by-side, verified specs against its closest language / nlp alternatives.

Related in Datasets