Kencorpus Kenyan Language Corpus vs MasakhaNER 2.0

A verified, side-by-side comparison. Both records are status-checked by Findra, so you are comparing what each actually offers today, not a stale listing.

Kencorpus Kenyan Language Corpus MasakhaNER 2.0

Tags

nlp, corpus, swahili, dholuo, luhya

nlp, ner, named-entity-recognition, african-languages, token-classification

Links

Website Docs

Website Docs GitHub

Summary

Text and speech corpus for three Kenyan languages, Swahili, Dholuo and Luhya, containing 4,442 texts (5.6 million words) and 1,152 speech files (177 hours). It also ships derived NLP sets: POS-tagged Dholuo/Luhya, 7,537 Swahili question-answer pairs and 13,400 translated sentences. Downloadable from Harvard Dataverse; released 2022.

Largest high-quality named-entity-recognition corpus for 20 African languages (incl. Nigerian Pidgin, Hausa, Igbo, Yoruba) with PER/ORG/LOC/DATE tags over news-domain text, totaling ~152,786 rows. Built by the Masakhane community.

Full details: Kencorpus Kenyan Language Corpus Full details: MasakhaNER 2.0