MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Paper
β’
2502.18924
β’
Published
β’
16
This is a complete MegaTTS3 model with WaveVAE support for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.
Key Features:
π₯ One-Click Windows Installer - Automated setup with GPU detection
Or see manual installation for advanced users.
# Basic voice cloning
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output
# Better quality settings
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0
# Web interface (easiest)
python tts/megatts3_gradio.py
# Then open http://localhost:7929
--p_w (Intelligibility): 1.0-5.0, higher = clearer speech--t_w (Similarity): 0.0-10.0, higher = more similar to referenceIf you use this model, please cite the original research:
@article{jiang2025sparse,
title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
journal={arXiv preprint arXiv:2502.18924},
year={2025}
}
@article{ji2024wavtokenizer,
title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
journal={arXiv preprint arXiv:2408.16532},
year={2024}
}
High-quality voice cloning for research and creative applications. Please use responsibly.