F5-TTS Yorùbá — First Zero-Shot Voice Cloning TTS for Yorùbá

Fine-tuned F5-TTS v1 Base (335M params) for Yorùbá language with full tonal diacritic support.

Highlights

Zero-shot voice cloning: Clone any voice from a 5-10 second reference clip
Tonal language support: Diacritic-driven pitch generation (ẹ́, ọ̀, ṣ, etc.)
First of its kind: First DiT-based / flow-matching TTS model for any African language
Character-level tokenization: No phonemizer needed — raw Yorùbá text with diacritics in, speech out
Fast inference: ~350ms per utterance at NFE=16 on L4 GPU (RTF ≈ 0.15)

Training Details


Base model	F5-TTS v1 Base (flow-matching DiT)
Parameters	335M
Training data	~13.4 hours (BibleTTS + SLR86 + WAXAL)
Steps	150,000
GPUs	2× A100-40GB
Effective batch	38,400 frames/step
Learning rate	7.5e-5
Tokenizer	Character-level (no pinyin)
Vocab	2,562 chars (base + Yorùbá diacritics)

Usage

import f5_tts.model.utils as f5_utils
# CRITICAL: Must bypass pinyin conversion BEFORE any other F5-TTS import
f5_utils.convert_char_to_pinyin = lambda texts, polyphone=True: texts

from f5_tts.api import F5TTS

f5tts = F5TTS(
    model="F5TTS_v1_Base",
    ckpt_file="model_150000.pt",
    vocab_file="vocab.txt",
    device="cuda",
)

# ref_file: 5-10s WAV of any voice (add ~1s trailing silence for best results)
# ref_text: what the reference says (Yorùbá or English)
# gen_text: Yorùbá text with full diacritics
wav, sr, _ = f5tts.infer(
    ref_file="reference.wav",
    ref_text="text spoken in reference",
    gen_text="ẹ kú àárọ̀, báwo ni àwọn ọmọ yín ṣe wà?",
    speed=1.0,
    nfe_step=16,
    file_wave="output.wav",
)

Tips for Best Results

Use full diacritics: "oko" (hoe), "okó" (husband), "okò" (vehicle) are different words. Diacritics drive pitch.
Reference audio: Use clean 5-10 second clips with ~1 second of trailing silence. Avoid background music/noise.
Reference text: Must accurately match what is spoken in the reference audio.
NFE steps: 16 recommended for best quality/speed tradeoff. Use 8 for faster inference with slight quality reduction.
Speed: 1.0 recommended. Lower values (e.g. 0.85) may prevent truncation on long text but can cause slight shakiness.
Text cleanup: Strip any bracket tags (e.g. [breath], [snap]) from reference text if present.

Training Data

Source	Samples	Hours	Notes
BibleTTS Yorùbá	7,560	~8h	Studio quality, single speaker
SLR86	3,583	~3h	Crowdsourced, male + female
WAXAL TTS	1,492	~3h	Diacritics restored via Gemini

Limitations

Tonal minimal pair differentiation is moderate (not perfect for all tone contrasts)
Occasional brief audio artifacts (~5% of generations) — regenerating typically produces a clean output
Reference audio may bleed slightly into the start of generated audio — trimming the first 200-400ms helps
Requires full Yorùbá diacritics in input text for best results
Reference audio quality directly affects output quality

What's Next

Phase 2 tonal fine-tuning (oversampled minimal pairs)
Hausa and Igbo models using the same pipeline
Nigerian Pidgin (non-tonal, simpler)
Edge-optimized smaller model for on-device inference

License

CC-BY-NC-4.0

This model is free for research and non-commercial use. For commercial licensing, contact us.

The base F5-TTS pretrained model is licensed under CC-BY-NC due to the Emilia training dataset.

Citation

@misc{naijaml-f5tts-yoruba-2026,
  title={F5-TTS Yorùbá: First Zero-Shot Voice Cloning TTS for Yorùbá},
  author={NaijaML},
  year={2026},
  url={https://huggingface.co/naijaml/f5-tts-yoruba}
}

Acknowledgments

Built on F5-TTS by Yushen Chen et al. Training data from BibleTTS, SLR86, and Google WAXAL.

Downloads last month: 31

naijaml
/

f5-tts-yoruba