F5-TTS Yorùbá — First Zero-Shot Voice Cloning TTS for Yorùbá

Fine-tuned F5-TTS v1 Base (335M params) for Yorùbá language with full tonal diacritic support.

Highlights

  • Zero-shot voice cloning: Clone any voice from a 5-10 second reference clip
  • Tonal language support: Diacritic-driven pitch generation (ẹ́, ọ̀, ṣ, etc.)
  • First of its kind: First DiT-based / flow-matching TTS model for any African language
  • Character-level tokenization: No phonemizer needed — raw Yorùbá text with diacritics in, speech out
  • Fast inference: ~350ms per utterance at NFE=16 on L4 GPU (RTF ≈ 0.15)

Training Details

Base model F5-TTS v1 Base (flow-matching DiT)
Parameters 335M
Training data ~13.4 hours (BibleTTS + SLR86 + WAXAL)
Steps 150,000
GPUs 2× A100-40GB
Effective batch 38,400 frames/step
Learning rate 7.5e-5
Tokenizer Character-level (no pinyin)
Vocab 2,562 chars (base + Yorùbá diacritics)

Usage

import f5_tts.model.utils as f5_utils
# CRITICAL: Must bypass pinyin conversion BEFORE any other F5-TTS import
f5_utils.convert_char_to_pinyin = lambda texts, polyphone=True: texts

from f5_tts.api import F5TTS

f5tts = F5TTS(
    model="F5TTS_v1_Base",
    ckpt_file="model_150000.pt",
    vocab_file="vocab.txt",
    device="cuda",
)

# ref_file: 5-10s WAV of any voice (add ~1s trailing silence for best results)
# ref_text: what the reference says (Yorùbá or English)
# gen_text: Yorùbá text with full diacritics
wav, sr, _ = f5tts.infer(
    ref_file="reference.wav",
    ref_text="text spoken in reference",
    gen_text="ẹ kú àárọ̀, báwo ni àwọn ọmọ yín ṣe wà?",
    speed=1.0,
    nfe_step=16,
    file_wave="output.wav",
)

Tips for Best Results

  • Use full diacritics: "oko" (hoe), "okó" (husband), "okò" (vehicle) are different words. Diacritics drive pitch.
  • Reference audio: Use clean 5-10 second clips with ~1 second of trailing silence. Avoid background music/noise.
  • Reference text: Must accurately match what is spoken in the reference audio.
  • NFE steps: 16 recommended for best quality/speed tradeoff. Use 8 for faster inference with slight quality reduction.
  • Speed: 1.0 recommended. Lower values (e.g. 0.85) may prevent truncation on long text but can cause slight shakiness.
  • Text cleanup: Strip any bracket tags (e.g. [breath], [snap]) from reference text if present.

Training Data

Source Samples Hours Notes
BibleTTS Yorùbá 7,560 ~8h Studio quality, single speaker
SLR86 3,583 ~3h Crowdsourced, male + female
WAXAL TTS 1,492 ~3h Diacritics restored via Gemini

Limitations

  • Tonal minimal pair differentiation is moderate (not perfect for all tone contrasts)
  • Occasional brief audio artifacts (~5% of generations) — regenerating typically produces a clean output
  • Reference audio may bleed slightly into the start of generated audio — trimming the first 200-400ms helps
  • Requires full Yorùbá diacritics in input text for best results
  • Reference audio quality directly affects output quality

What's Next

  • Phase 2 tonal fine-tuning (oversampled minimal pairs)
  • Hausa and Igbo models using the same pipeline
  • Nigerian Pidgin (non-tonal, simpler)
  • Edge-optimized smaller model for on-device inference

License

CC-BY-NC-4.0

This model is free for research and non-commercial use. For commercial licensing, contact us.

The base F5-TTS pretrained model is licensed under CC-BY-NC due to the Emilia training dataset.

Citation

@misc{naijaml-f5tts-yoruba-2026,
  title={F5-TTS Yorùbá: First Zero-Shot Voice Cloning TTS for Yorùbá},
  author={NaijaML},
  year={2026},
  url={https://huggingface.co/naijaml/f5-tts-yoruba}
}

Acknowledgments

Built on F5-TTS by Yushen Chen et al. Training data from BibleTTS, SLR86, and Google WAXAL.

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using naijaml/f5-tts-yoruba 1