Corpus Manager

Build and manage bilingual and monolingual corpora from high-quality African language datasets. Preview results, download in multiple formats, and track license compliance.

Corpus Configuration

Configure your corpus building parameters

Available Data Sources

Select corpus sources to include

High-quality African language texts

CC-BY-4.0
90%
Languages: tw, yo, ha, sw, ig~50,000 entries
✓ Commercial
Attribution required

Large collection of parallel texts

Various (CC-BY, CC0)
80%
Languages: en, tw, yo, ha, sw, ig~100,000 entries
✓ Commercial
Attribution required

Parallel corpus from Jehovah's Witnesses publications

Research Use Only
85%
Languages: en, tw, yo, ha, sw, ig, ee, gaa~75,000 entries
✗ Non-commercial
Attribution required

News articles and social media content

CC-BY-SA-4.0
70%
Languages: tw, yo, ha, sw, ig, ee~200,000 entries
✓ Commercial
Attribution required
Share-alike

No corpus built yet

Configure your parameters and click "Build Corpus" to get started.