Corpus Manager

Build and manage bilingual and monolingual corpora from high-quality African language datasets. Track licenses and ensure compliance for your language processing needs.

Corpus Configuration

Configure your corpus building parameters

Available Data Sources

Select corpus sources to include

High-quality African language texts

CC-BY-4.0
90%
Languages: tw, yo, ha, sw, ig~50,000 entries

Large collection of parallel texts

Various (CC-BY, CC0)
80%
Languages: en, tw, yo, ha, sw, ig~100,000 entries

Parallel corpus from Jehovah's Witnesses publications

Research Use Only
85%
Languages: en, tw, yo, ha, sw, ig, ee, gaa~75,000 entries

News articles and social media content

CC-BY-SA-4.0
70%
Languages: tw, yo, ha, sw, ig, ee~200,000 entries