Spaces:
Runtime error
A newer version of the Gradio SDK is available:
6.2.0
MagpieTTS
MagpieTTS is NVIDIA's state-of-the-art neural Text-to-Speech (TTS) model that generates high-quality, natural-sounding speech from text using audio or text context for voice cloning and style transfer. Built on the NeMo Framework, MagpieTTS leverages transformer-based architectures with neural audio codecs to produce expressive, controllable speech synthesis.
Table of Contents
- Key Features
- Model Architecture
- NVIDIA NeMo
- How to Use This Model
- Inference Parameters
- Evaluation Metrics
- Advanced Features
- Legacy Checkpoint Support
- Datasets
- References
Key Features
MagpieTTS offers several key advantages for speech synthesis:
| Feature | Description |
|---|---|
| Zero-Shot Voice Cloning | Clone any voice with just a few seconds of reference audio |
| Text Context Support | Use text descriptions to control speaker style and characteristics |
| High-Quality Audio | Generates 24kHz audio using neural audio codecs |
| Classifier-Free Guidance | Enhanced generation quality with CFG support |
| Frame Stacking | Accelerated inference through multi-frame processing |
| Preference Optimization | DPO and GRPO support for improved output quality |
| Attention Prior | Monotonic alignment enforcement for robust synthesis |
| Local Transformer | Fast codebook prediction with MaskGit or autoregressive decoding |
Model Architecture
MagpieTTS is built on a transformer encoder-decoder architecture that processes text and audio context to generate neural audio codec tokens. The model learns to synthesize speech by predicting discrete audio tokens from a neural audio codec, conditioned on text transcripts and speaker context.
Core Components
1. Encoder
The encoder processes the input transcript and produces contextualized representations:
- Tokenization: Text is converted to IPA phonemes using the IPA tokenizer with grapheme-to-phoneme (G2P) conversion, or using a BPE character tokenizer for multilingual support
- Architecture: Transformer encoder with causal self-attention
- Output: Text representations used as conditioning for the decoder via cross-attention
# Default encoder configuration
encoder:
n_layers: 6
d_model: 768
d_ffn: 3072
sa_n_heads: 12
is_causal: true
2. Context Processing
MagpieTTS supports two types of context for voice cloning and style transfer:
| Context Type | Description | Processing |
|---|---|---|
| Audio Context | Reference speech audio (typically 5 seconds) | Encoded via audio codec β embedded β fed to decoder |
| Text Context | Textual description of speaker/style | Tokenized via ByT5 β embedded β fed to decoder |
The context is prepended to the decoder input, allowing the model to attend to speaker characteristics during generation.
3. Decoder
The decoder is the core generation component that autoregressively predicts audio codec tokens:
- Architecture: Transformer decoder with causal self-attention and cross-attention to encoder output
- Input: Context embeddings + previously generated audio token embeddings
- Cross-Attention: Attends to text encoder output for content alignment
- Output: Logits for all codebooks at each timestep
# Default decoder configuration
decoder:
n_layers: 12
d_model: 768
d_ffn: 3072
sa_n_heads: 12
xa_n_heads: 1 # Cross-attention heads
has_xattn: true
is_causal: true
4. Local Transformer (Optional)
For models using multiple codebooks, the Local Transformer refines per-frame predictions:
| Mode | Description |
|---|---|
| Autoregressive | Predicts codebooks sequentially within each frame |
| MaskGit | Parallel prediction with iterative refinement |
The Local Transformer takes the decoder's hidden state and generates tokens for all codebooks at that timestep.
5. Audio Codec
MagpieTTS uses a neural audio codec to convert between waveforms and discrete tokens:
- Encoding: Raw audio β Discrete tokens (multiple codebooks)
- Decoding: Discrete tokens β High-quality 24kHz waveform
- Frame Rate: Typically 21 Hz (frames per second)
- Codebooks: Multiple parallel codebooks (e.g., 8) for high-fidelity reconstruction
Generation Process
During inference, MagpieTTS generates audio through the following steps:
1. ENCODE TEXT
βββ Transcript β IPA Tokens β Text Encoder β Text Representations
2. PREPARE CONTEXT
βββ Context Audio β Audio Codec β Context Embeddings
βββ OR: Context Text β ByT5 Tokens β Context Embeddings
3. AUTOREGRESSIVE DECODING
βββ For each timestep t:
βββ Input: [Context Embeddings, Audio Tokens 0..t-1]
βββ Decoder: Cross-attend to text, self-attend causally
βββ Output: Hidden state for timestep t
βββ Local Transformer: Hidden β Codebook tokens
4. DECODE AUDIO
βββ All Tokens β Audio Codec Decoder β Waveform (24kHz)
Attention Mechanisms
MagpieTTS employs multiple attention mechanisms for robust generation:
| Mechanism | Purpose |
|---|---|
| Causal Self-Attention | Autoregressive generation in decoder |
| Cross-Attention | Align audio generation with text content |
| Attention Prior | Beta-binomial prior for monotonic alignment |
| Alignment Encoder | Learned alignment for attention guidance |
Training Objectives
The model is trained with multiple loss functions:
| Loss | Description | Scale |
|---|---|---|
| Codebook Loss | Cross-entropy on predicted audio tokens | 1.0 |
| Alignment Loss | Forward-sum loss for monotonic attention | 0.002 |
| Local Transformer Loss | Auxiliary loss for codebook prediction | 1.0 |
| Alignment Encoder Loss | Auxiliary loss for learned alignment | 1.0 |
Supported Model Types
MagpieTTS supports multiple architecture variants optimized for different use cases:
| Model Type | Description | Best For |
|---|---|---|
decoder_context_tts |
Text β Encoder; Context audio/text + target audio β Decoder. Fixed-size context (5 seconds). | Standard voice cloning |
decoder_ce |
Same as above with additional Context Encoder network between context and decoder input. | Enhanced context processing |
NVIDIA NeMo
To train, fine-tune, or perform inference with MagpieTTS, you need to install NVIDIA NeMo. We recommend installing it after you've installed the latest PyTorch version.
# Install system dependencies
apt-get update && apt-get install -y libsndfile1 ffmpeg
# Install NeMo with TTS support
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[tts]
How to Use This Model
Quick Start: Inference
MagpieTTS can be loaded and used for inference in multiple ways.
Method 1: Using the Inference Script
The recommended way to run inference is using the provided inference script:
# Inference from a .nemo checkpoint
python examples/tts/magpietts_inference.py \
--nemo_files /path/to/model.nemo \
--datasets libritts_test_clean \
--out_dir /path/to/output \
--codecmodel_path /path/to/codec.nemo \
--temperature 0.6 \
--topk 80
# Inference from hparams + checkpoint with evaluation
python examples/tts/magpietts_inference.py \
--hparams_files /path/to/hparams.yaml \
--checkpoint_files /path/to/model.ckpt \
--datasets libritts_test_clean,vctk \
--out_dir /path/to/output \
--codecmodel_path /path/to/codec.nemo \
--run_evaluation \
--num_repeats 3 \
--use_cfg \
--cfg_scale 2.5
Method 2: Python API
from nemo.collections.tts.models import MagpieTTSModel
# Load from .nemo file
model = MagpieTTSModel.restore_from("/path/to/magpietts.nemo")
model.eval()
model.cuda()
# Prepare your batch (see MagpieTTSDataset for data format)
# batch = {...}
# Run inference
predicted_audio, predicted_audio_lens, _, _, rtf_metrics, cross_attn_maps, _ = model.infer_batch(
batch,
max_decoder_steps=500,
temperature=0.6,
topk=80,
use_cfg=True,
cfg_scale=2.5,
)
# Save audio
import soundfile as sf
audio_np = predicted_audio[0].cpu().numpy()[:predicted_audio_lens[0]]
sf.write("output.wav", audio_np, model.sample_rate)
Training
Train MagpieTTS using Hydra configuration:
python examples/tts/magpietts.py \
--config-name=magpietts \
max_epochs=100 \
batch_size=16 \
model.codecmodel_path=/path/to/codec.nemo \
exp_manager.exp_dir=/path/to/experiments \
+train_ds_meta.libritts.manifest_path=/path/to/train_manifest.json \
+train_ds_meta.libritts.audio_dir=/path/to/audio \
+val_ds_meta.libritts_val.manifest_path=/path/to/val_manifest.json \
+val_ds_meta.libritts_val.audio_dir=/path/to/audio
Inference Parameters
| Parameter | Default | Description |
|---|---|---|
temperature |
0.6 | Sampling temperature for token generation. Lower = more deterministic. |
topk |
80 | Top-k sampling parameter. Limits vocabulary to k most likely tokens. |
max_decoder_steps |
440 | Maximum number of decoder steps (frames to generate). |
use_cfg |
False | Enable Classifier-Free Guidance for improved quality. |
cfg_scale |
2.5 | CFG scale factor. Higher = stronger guidance. |
apply_attention_prior |
False | Apply monotonic attention prior for alignment. |
attention_prior_epsilon |
0.1 | Epsilon value for attention prior. |
use_local_transformer |
False | Use local transformer for codebook prediction. |
maskgit_n_steps |
3 | Number of MaskGit refinement steps (if using MaskGit). |
EOS Detection Methods
| Method | Description |
|---|---|
argmax_any |
Stop when any codebook predicts EOS via argmax |
argmax_or_multinomial_any |
Stop when any codebook predicts EOS (default) |
argmax_all |
Stop when all codebooks predict EOS via argmax |
argmax_zero_cb |
Stop when codebook 0 predicts EOS via argmax |
Evaluation Metrics
MagpieTTS evaluation includes multiple quality metrics:
| Metric | Description |
|---|---|
| CER | Character Error Rate - measures transcription accuracy |
| WER | Word Error Rate - measures transcription accuracy at word level |
| SSIM | Speaker Similarity - cosine similarity between speaker embeddings |
| UTMOSv2 | Mean Opinion Score prediction for audio quality |
| RTF | Real-Time Factor - inference speed metric |
Run evaluation with the inference script:
python examples/tts/magpietts_inference.py \
--nemo_files /path/to/model.nemo \
--datasets libritts_test_clean \
--out_dir /path/to/output \
--codecmodel_path /path/to/codec.nemo \
--run_evaluation \
--num_repeats 3 \
--cer_target 0.1 \
--ssim_target 0.8
Advanced Features
Frame Stacking
Frame stacking accelerates inference by having the base decoder process multiple consecutive audio frames in a single forward pass.
Overview
In this two-stage approach:
- The base decoder processes multiple frames at once, producing a single latent representation for each group (stack) of frames
- The Local Transformer then generates the individual
frames Γ codebookstokens
Configuration
Enable frame stacking by setting frame_stacking_factor > 1 in your YAML config:
model:
frame_stacking_factor: 2 # Process 2 frames per decoder step
local_transformer_type: "autoregressive" # Required with frame stacking
local_transformer_n_layers: 1
local_transformer_n_heads: 1
local_transformer_hidden_dim: 256
Speed Benefits
The Local Transformer is much faster than the base decoder due to:
- Fewer parameters: The LT decoder is lightweight compared to the base decoder
- Shorter sequences: The LT decoder only attends to the current frame stack and latent, not the entire sequence
Current Limitations
- Online code extraction combined with frame-stacking is not yet implemented
- Alignment encoder with frame-stacking is not yet tested
- CTC loss with frame-stacking is not yet implemented
Preference Optimization (DPO/GRPO)
MagpieTTS supports both offline (DPO) and online (GRPO) preference optimization for improved output quality.
Offline Preference Alignment (DPO/RPO)
Step 1: Create text-context pairs
python scripts/magpietts/dpo/create_text_contextpairs.py \
--challenging_texts /path/to/challenging_texts.txt \
--regular_texts_for_audiocontext /path/to/regular_texts.txt \
--regular_texts_for_textcontext /path/to/text_context_texts.txt \
--audio_contexts /path/to/audio_contexts.json \
--text_contexts /path/to/text_contexts.txt \
--output_manifest /path/to/text_context_pairs.json \
--nsamples_perpair 6
Step 2: Generate audios for each pair
python examples/tts/magpietts.py \
--config-name=magpietts_po_inference \
mode=test \
batch_size=64 \
+init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
exp_manager.exp_dir=/path/to/po_exp \
+test_ds_meta.textcontextpairs.manifest_path=/path/to/text_context_pairs.json \
+test_ds_meta.textcontextpairs.audio_dir="/" \
model.codecmodel_path=/path/to/codec.nemo
Step 3: Create chosen-rejected pairs
python scripts/magpietts/dpo/create_preference_pairs.py \
--input_manifest /path/to/text_context_pairs.json \
--generated_audio_dir /path/to/po_exp/audio \
--group_size 6 \
--cer_threshold 0.01 \
--val_size 256
Step 4: DPO Finetuning
python examples/tts/magpietts.py \
batch_size=4 \
+init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
+mode="dpo_train" \
max_epochs=10 \
+model.dpo_beta=0.01 \
model.train_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
model.validation_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
+train_ds_meta.dpopreftrain.manifest_path=/path/to/dpo_train_manifest.json \
model.optim.lr=2e-7
Online Preference Optimization (GRPO)
GRPO generates samples online and optimizes based on computed rewards (CER, SSIM, PESQ).
python examples/tts/magpietts.py \
batch_size=2 \
+init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
+mode="onlinepo_train" \
+model.num_generations_per_item=12 \
+model.reference_free=true \
+model.cer_reward_weight=0.33 \
+model.ssim_reward_weight=0.33 \
+model.pesq_reward_weight=0.33 \
+model.use_pesq=true \
+model.loss_type="grpo" \
+model.scale_rewards=true \
+model.inference_temperature=0.8 \
+model.use_kv_cache_during_online_po=true \
model.optim.lr=1e-7 \
trainer.precision=32 \
+trainer.gradient_clip_val=2.5
Key GRPO Parameters:
| Parameter | Default | Description |
|---|---|---|
num_generations_per_item |
12 | Number of samples generated per batch item |
reference_free |
True | Skip KL divergence loss term |
cer_reward_weight |
0.33 | Weight for CER reward |
ssim_reward_weight |
0.33 | Weight for speaker similarity reward |
pesq_reward_weight |
0.33 | Weight for PESQ audio quality reward |
loss_type |
"grpo" | Use "grpo" or "dr_grpo" |
scale_rewards |
True | Normalize advantages by std deviation |
Legacy Checkpoint Support
MagpieTTS has evolved its embedding table layout over time. For checkpoints created before April 2025, you may need to enable legacy mode.
Using Legacy Checkpoints
With the Inference Script
python examples/tts/magpietts_inference.py \
--hparams_files /path/to/hparams.yaml \
--checkpoint_files /path/to/old_checkpoint.ckpt \
--codecmodel_path /path/to/codec.nemo \
--legacy_codebooks \
--legacy_text_conditioning \
...
With Hydra Command Line
# For decoder_context_tts models
python examples/tts/magpietts.py \
... \
+model.forced_num_all_tokens_per_codebook=2048 \
+model.forced_audio_eos_id=2047 \
+model.forced_audio_bos_id=2046 \
+model.forced_context_audio_eos_id=2045 \
+model.forced_context_audio_bos_id=2044
# For other model types
python examples/tts/magpietts.py \
... \
+model.forced_num_all_tokens_per_codebook=2048 \
+model.forced_audio_eos_id=2047 \
+model.forced_audio_bos_id=2046 \
+model.forced_context_audio_eos_id=2047 \
+model.forced_context_audio_bos_id=2046
Embedding Table Layouts
| Version | Layout | Notes |
|---|---|---|
| Legacy (pre-April 2025) | Special tokens at indices 2044-2047 | Use --legacy_codebooks |
| Current | Special tokens immediately after codec tokens (2016-2023) | Automatic |
Datasets
Supported Evaluation Datasets
| Dataset | Description |
|---|---|
libritts_test_clean |
LibriTTS test-clean subset |
libritts_seen |
LibriTTS seen speakers evaluation |
vctk |
VCTK multi-speaker dataset |
riva_hard_digits |
RIVA challenging digits |
riva_hard_letters |
RIVA challenging letters |
riva_hard_money |
RIVA challenging monetary values |
riva_hard_short |
RIVA challenging short utterances |
Manifest Format
MagpieTTS expects JSONL manifest files with the following format:
{
"audio_filepath": "path/to/target_audio.wav",
"text": "The transcript text",
"duration": 5.2,
"context_audio_filepath": "path/to/context_audio.wav",
"context_audio_duration": 5.0
}
References
- NVIDIA NeMo Framework: https://github.com/NVIDIA/NeMo
- Fast Conformer with Linearly Scalable Attention: arXiv:2305.05084
- Attention Is All You Need: arXiv:1706.03762
- Direct Preference Optimization (DPO): arXiv:2305.18290
- Group Relative Policy Optimization (GRPO): arXiv:2402.03300
Discover More from NVIDIA
For documentation, deployment guides, enterprise-ready APIs, and the latest open modelsβincluding Nemotron and other cutting-edge speech, translation, and generative AIβvisit the NVIDIA Developer Portal.
Explore More:
License: Apache 2.0 | Copyright: Β© 2025 NVIDIA Corporation