F5-TTS Yorùbá — First Zero-Shot Voice Cloning TTS for Yorùbá
Fine-tuned F5-TTS v1 Base (335M params) for Yorùbá language with full tonal diacritic support.
Highlights
- Zero-shot voice cloning: Clone any voice from a 5-10 second reference clip
- Tonal language support: Diacritic-driven pitch generation (ẹ́, ọ̀, ṣ, etc.)
- First of its kind: First DiT-based / flow-matching TTS model for any African language
- Character-level tokenization: No phonemizer needed — raw Yorùbá text with diacritics in, speech out
- Fast inference: ~350ms per utterance at NFE=16 on L4 GPU (RTF ≈ 0.15)
Training Details
| Base model | F5-TTS v1 Base (flow-matching DiT) |
| Parameters | 335M |
| Training data | ~13.4 hours (BibleTTS + SLR86 + WAXAL) |
| Steps | 150,000 |
| GPUs | 2× A100-40GB |
| Effective batch | 38,400 frames/step |
| Learning rate | 7.5e-5 |
| Tokenizer | Character-level (no pinyin) |
| Vocab | 2,562 chars (base + Yorùbá diacritics) |
Usage
import f5_tts.model.utils as f5_utils
# CRITICAL: Must bypass pinyin conversion BEFORE any other F5-TTS import
f5_utils.convert_char_to_pinyin = lambda texts, polyphone=True: texts
from f5_tts.api import F5TTS
f5tts = F5TTS(
model="F5TTS_v1_Base",
ckpt_file="model_150000.pt",
vocab_file="vocab.txt",
device="cuda",
)
# ref_file: 5-10s WAV of any voice (add ~1s trailing silence for best results)
# ref_text: what the reference says (Yorùbá or English)
# gen_text: Yorùbá text with full diacritics
wav, sr, _ = f5tts.infer(
ref_file="reference.wav",
ref_text="text spoken in reference",
gen_text="ẹ kú àárọ̀, báwo ni àwọn ọmọ yín ṣe wà?",
speed=1.0,
nfe_step=16,
file_wave="output.wav",
)
Tips for Best Results
- Use full diacritics: "oko" (hoe), "okó" (husband), "okò" (vehicle) are different words. Diacritics drive pitch.
- Reference audio: Use clean 5-10 second clips with ~1 second of trailing silence. Avoid background music/noise.
- Reference text: Must accurately match what is spoken in the reference audio.
- NFE steps: 16 recommended for best quality/speed tradeoff. Use 8 for faster inference with slight quality reduction.
- Speed: 1.0 recommended. Lower values (e.g. 0.85) may prevent truncation on long text but can cause slight shakiness.
- Text cleanup: Strip any bracket tags (e.g.
[breath],[snap]) from reference text if present.
Training Data
| Source | Samples | Hours | Notes |
|---|---|---|---|
| BibleTTS Yorùbá | 7,560 | ~8h | Studio quality, single speaker |
| SLR86 | 3,583 | ~3h | Crowdsourced, male + female |
| WAXAL TTS | 1,492 | ~3h | Diacritics restored via Gemini |
Limitations
- Tonal minimal pair differentiation is moderate (not perfect for all tone contrasts)
- Occasional brief audio artifacts (~5% of generations) — regenerating typically produces a clean output
- Reference audio may bleed slightly into the start of generated audio — trimming the first 200-400ms helps
- Requires full Yorùbá diacritics in input text for best results
- Reference audio quality directly affects output quality
What's Next
- Phase 2 tonal fine-tuning (oversampled minimal pairs)
- Hausa and Igbo models using the same pipeline
- Nigerian Pidgin (non-tonal, simpler)
- Edge-optimized smaller model for on-device inference
License
CC-BY-NC-4.0
This model is free for research and non-commercial use. For commercial licensing, contact us.
The base F5-TTS pretrained model is licensed under CC-BY-NC due to the Emilia training dataset.
Citation
@misc{naijaml-f5tts-yoruba-2026,
title={F5-TTS Yorùbá: First Zero-Shot Voice Cloning TTS for Yorùbá},
author={NaijaML},
year={2026},
url={https://huggingface.co/naijaml/f5-tts-yoruba}
}
Acknowledgments
Built on F5-TTS by Yushen Chen et al. Training data from BibleTTS, SLR86, and Google WAXAL.
- Downloads last month
- 31