MagpieTTS_Internal_Demo / examples /tts /MagpieTTS_README.md
subhankarg's picture
Upload folder using huggingface_hub
0558aa4 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

MagpieTTS

NeMo PyTorch License

MagpieTTS is NVIDIA's state-of-the-art neural Text-to-Speech (TTS) model that generates high-quality, natural-sounding speech from text using audio or text context for voice cloning and style transfer. Built on the NeMo Framework, MagpieTTS leverages transformer-based architectures with neural audio codecs to produce expressive, controllable speech synthesis.


Table of Contents


Key Features

MagpieTTS offers several key advantages for speech synthesis:

Feature Description
Zero-Shot Voice Cloning Clone any voice with just a few seconds of reference audio
Text Context Support Use text descriptions to control speaker style and characteristics
High-Quality Audio Generates 24kHz audio using neural audio codecs
Classifier-Free Guidance Enhanced generation quality with CFG support
Frame Stacking Accelerated inference through multi-frame processing
Preference Optimization DPO and GRPO support for improved output quality
Attention Prior Monotonic alignment enforcement for robust synthesis
Local Transformer Fast codebook prediction with MaskGit or autoregressive decoding

Model Architecture

MagpieTTS is built on a transformer encoder-decoder architecture that processes text and audio context to generate neural audio codec tokens. The model learns to synthesize speech by predicting discrete audio tokens from a neural audio codec, conditioned on text transcripts and speaker context.

Core Components

1. Encoder

The encoder processes the input transcript and produces contextualized representations:

  • Tokenization: Text is converted to IPA phonemes using the IPA tokenizer with grapheme-to-phoneme (G2P) conversion, or using a BPE character tokenizer for multilingual support
  • Architecture: Transformer encoder with causal self-attention
  • Output: Text representations used as conditioning for the decoder via cross-attention
# Default encoder configuration
encoder:
  n_layers: 6
  d_model: 768
  d_ffn: 3072
  sa_n_heads: 12
  is_causal: true

2. Context Processing

MagpieTTS supports two types of context for voice cloning and style transfer:

Context Type Description Processing
Audio Context Reference speech audio (typically 5 seconds) Encoded via audio codec β†’ embedded β†’ fed to decoder
Text Context Textual description of speaker/style Tokenized via ByT5 β†’ embedded β†’ fed to decoder

The context is prepended to the decoder input, allowing the model to attend to speaker characteristics during generation.

3. Decoder

The decoder is the core generation component that autoregressively predicts audio codec tokens:

  • Architecture: Transformer decoder with causal self-attention and cross-attention to encoder output
  • Input: Context embeddings + previously generated audio token embeddings
  • Cross-Attention: Attends to text encoder output for content alignment
  • Output: Logits for all codebooks at each timestep
# Default decoder configuration  
decoder:
  n_layers: 12
  d_model: 768
  d_ffn: 3072
  sa_n_heads: 12
  xa_n_heads: 1       # Cross-attention heads
  has_xattn: true
  is_causal: true

4. Local Transformer (Optional)

For models using multiple codebooks, the Local Transformer refines per-frame predictions:

Mode Description
Autoregressive Predicts codebooks sequentially within each frame
MaskGit Parallel prediction with iterative refinement

The Local Transformer takes the decoder's hidden state and generates tokens for all codebooks at that timestep.

5. Audio Codec

MagpieTTS uses a neural audio codec to convert between waveforms and discrete tokens:

  • Encoding: Raw audio β†’ Discrete tokens (multiple codebooks)
  • Decoding: Discrete tokens β†’ High-quality 24kHz waveform
  • Frame Rate: Typically 21 Hz (frames per second)
  • Codebooks: Multiple parallel codebooks (e.g., 8) for high-fidelity reconstruction

Generation Process

During inference, MagpieTTS generates audio through the following steps:

1. ENCODE TEXT
   └── Transcript β†’ IPA Tokens β†’ Text Encoder β†’ Text Representations

2. PREPARE CONTEXT  
   └── Context Audio β†’ Audio Codec β†’ Context Embeddings
   └── OR: Context Text β†’ ByT5 Tokens β†’ Context Embeddings

3. AUTOREGRESSIVE DECODING
   └── For each timestep t:
       β”œβ”€β”€ Input: [Context Embeddings, Audio Tokens 0..t-1]
       β”œβ”€β”€ Decoder: Cross-attend to text, self-attend causally
       β”œβ”€β”€ Output: Hidden state for timestep t
       └── Local Transformer: Hidden β†’ Codebook tokens

4. DECODE AUDIO
   └── All Tokens β†’ Audio Codec Decoder β†’ Waveform (24kHz)

Attention Mechanisms

MagpieTTS employs multiple attention mechanisms for robust generation:

Mechanism Purpose
Causal Self-Attention Autoregressive generation in decoder
Cross-Attention Align audio generation with text content
Attention Prior Beta-binomial prior for monotonic alignment
Alignment Encoder Learned alignment for attention guidance

Training Objectives

The model is trained with multiple loss functions:

Loss Description Scale
Codebook Loss Cross-entropy on predicted audio tokens 1.0
Alignment Loss Forward-sum loss for monotonic attention 0.002
Local Transformer Loss Auxiliary loss for codebook prediction 1.0
Alignment Encoder Loss Auxiliary loss for learned alignment 1.0

Supported Model Types

MagpieTTS supports multiple architecture variants optimized for different use cases:

Model Type Description Best For
decoder_context_tts Text β†’ Encoder; Context audio/text + target audio β†’ Decoder. Fixed-size context (5 seconds). Standard voice cloning
decoder_ce Same as above with additional Context Encoder network between context and decoder input. Enhanced context processing

NVIDIA NeMo

To train, fine-tune, or perform inference with MagpieTTS, you need to install NVIDIA NeMo. We recommend installing it after you've installed the latest PyTorch version.

# Install system dependencies
apt-get update && apt-get install -y libsndfile1 ffmpeg

# Install NeMo with TTS support
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[tts]

How to Use This Model

Quick Start: Inference

MagpieTTS can be loaded and used for inference in multiple ways.

Method 1: Using the Inference Script

The recommended way to run inference is using the provided inference script:

# Inference from a .nemo checkpoint
python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/model.nemo \
    --datasets libritts_test_clean \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --temperature 0.6 \
    --topk 80

# Inference from hparams + checkpoint with evaluation
python examples/tts/magpietts_inference.py \
    --hparams_files /path/to/hparams.yaml \
    --checkpoint_files /path/to/model.ckpt \
    --datasets libritts_test_clean,vctk \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --run_evaluation \
    --num_repeats 3 \
    --use_cfg \
    --cfg_scale 2.5

Method 2: Python API

from nemo.collections.tts.models import MagpieTTSModel

# Load from .nemo file
model = MagpieTTSModel.restore_from("/path/to/magpietts.nemo")
model.eval()
model.cuda()

# Prepare your batch (see MagpieTTSDataset for data format)
# batch = {...}

# Run inference
predicted_audio, predicted_audio_lens, _, _, rtf_metrics, cross_attn_maps, _ = model.infer_batch(
    batch,
    max_decoder_steps=500,
    temperature=0.6,
    topk=80,
    use_cfg=True,
    cfg_scale=2.5,
)

# Save audio
import soundfile as sf
audio_np = predicted_audio[0].cpu().numpy()[:predicted_audio_lens[0]]
sf.write("output.wav", audio_np, model.sample_rate)

Training

Train MagpieTTS using Hydra configuration:

python examples/tts/magpietts.py \
    --config-name=magpietts \
    max_epochs=100 \
    batch_size=16 \
    model.codecmodel_path=/path/to/codec.nemo \
    exp_manager.exp_dir=/path/to/experiments \
    +train_ds_meta.libritts.manifest_path=/path/to/train_manifest.json \
    +train_ds_meta.libritts.audio_dir=/path/to/audio \
    +val_ds_meta.libritts_val.manifest_path=/path/to/val_manifest.json \
    +val_ds_meta.libritts_val.audio_dir=/path/to/audio

Inference Parameters

Parameter Default Description
temperature 0.6 Sampling temperature for token generation. Lower = more deterministic.
topk 80 Top-k sampling parameter. Limits vocabulary to k most likely tokens.
max_decoder_steps 440 Maximum number of decoder steps (frames to generate).
use_cfg False Enable Classifier-Free Guidance for improved quality.
cfg_scale 2.5 CFG scale factor. Higher = stronger guidance.
apply_attention_prior False Apply monotonic attention prior for alignment.
attention_prior_epsilon 0.1 Epsilon value for attention prior.
use_local_transformer False Use local transformer for codebook prediction.
maskgit_n_steps 3 Number of MaskGit refinement steps (if using MaskGit).

EOS Detection Methods

Method Description
argmax_any Stop when any codebook predicts EOS via argmax
argmax_or_multinomial_any Stop when any codebook predicts EOS (default)
argmax_all Stop when all codebooks predict EOS via argmax
argmax_zero_cb Stop when codebook 0 predicts EOS via argmax

Evaluation Metrics

MagpieTTS evaluation includes multiple quality metrics:

Metric Description
CER Character Error Rate - measures transcription accuracy
WER Word Error Rate - measures transcription accuracy at word level
SSIM Speaker Similarity - cosine similarity between speaker embeddings
UTMOSv2 Mean Opinion Score prediction for audio quality
RTF Real-Time Factor - inference speed metric

Run evaluation with the inference script:

python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/model.nemo \
    --datasets libritts_test_clean \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --run_evaluation \
    --num_repeats 3 \
    --cer_target 0.1 \
    --ssim_target 0.8

Advanced Features

Frame Stacking

Frame stacking accelerates inference by having the base decoder process multiple consecutive audio frames in a single forward pass.

Overview

In this two-stage approach:

  1. The base decoder processes multiple frames at once, producing a single latent representation for each group (stack) of frames
  2. The Local Transformer then generates the individual frames Γ— codebooks tokens

Configuration

Enable frame stacking by setting frame_stacking_factor > 1 in your YAML config:

model:
  frame_stacking_factor: 2  # Process 2 frames per decoder step
  local_transformer_type: "autoregressive"  # Required with frame stacking
  local_transformer_n_layers: 1
  local_transformer_n_heads: 1
  local_transformer_hidden_dim: 256

Speed Benefits

The Local Transformer is much faster than the base decoder due to:

  • Fewer parameters: The LT decoder is lightweight compared to the base decoder
  • Shorter sequences: The LT decoder only attends to the current frame stack and latent, not the entire sequence

Current Limitations

  • Online code extraction combined with frame-stacking is not yet implemented
  • Alignment encoder with frame-stacking is not yet tested
  • CTC loss with frame-stacking is not yet implemented

Preference Optimization (DPO/GRPO)

MagpieTTS supports both offline (DPO) and online (GRPO) preference optimization for improved output quality.

Offline Preference Alignment (DPO/RPO)

Step 1: Create text-context pairs

python scripts/magpietts/dpo/create_text_contextpairs.py \
    --challenging_texts /path/to/challenging_texts.txt \
    --regular_texts_for_audiocontext /path/to/regular_texts.txt \
    --regular_texts_for_textcontext /path/to/text_context_texts.txt \
    --audio_contexts /path/to/audio_contexts.json \
    --text_contexts /path/to/text_contexts.txt \
    --output_manifest /path/to/text_context_pairs.json \
    --nsamples_perpair 6

Step 2: Generate audios for each pair

python examples/tts/magpietts.py \
    --config-name=magpietts_po_inference \
    mode=test \
    batch_size=64 \
    +init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
    exp_manager.exp_dir=/path/to/po_exp \
    +test_ds_meta.textcontextpairs.manifest_path=/path/to/text_context_pairs.json \
    +test_ds_meta.textcontextpairs.audio_dir="/" \
    model.codecmodel_path=/path/to/codec.nemo

Step 3: Create chosen-rejected pairs

python scripts/magpietts/dpo/create_preference_pairs.py \
    --input_manifest /path/to/text_context_pairs.json \
    --generated_audio_dir /path/to/po_exp/audio \
    --group_size 6 \
    --cer_threshold 0.01 \
    --val_size 256

Step 4: DPO Finetuning

python examples/tts/magpietts.py \
    batch_size=4 \
    +init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
    +mode="dpo_train" \
    max_epochs=10 \
    +model.dpo_beta=0.01 \
    model.train_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
    model.validation_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
    +train_ds_meta.dpopreftrain.manifest_path=/path/to/dpo_train_manifest.json \
    model.optim.lr=2e-7

Online Preference Optimization (GRPO)

GRPO generates samples online and optimizes based on computed rewards (CER, SSIM, PESQ).

python examples/tts/magpietts.py \
    batch_size=2 \
    +init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
    +mode="onlinepo_train" \
    +model.num_generations_per_item=12 \
    +model.reference_free=true \
    +model.cer_reward_weight=0.33 \
    +model.ssim_reward_weight=0.33 \
    +model.pesq_reward_weight=0.33 \
    +model.use_pesq=true \
    +model.loss_type="grpo" \
    +model.scale_rewards=true \
    +model.inference_temperature=0.8 \
    +model.use_kv_cache_during_online_po=true \
    model.optim.lr=1e-7 \
    trainer.precision=32 \
    +trainer.gradient_clip_val=2.5

Key GRPO Parameters:

Parameter Default Description
num_generations_per_item 12 Number of samples generated per batch item
reference_free True Skip KL divergence loss term
cer_reward_weight 0.33 Weight for CER reward
ssim_reward_weight 0.33 Weight for speaker similarity reward
pesq_reward_weight 0.33 Weight for PESQ audio quality reward
loss_type "grpo" Use "grpo" or "dr_grpo"
scale_rewards True Normalize advantages by std deviation

Legacy Checkpoint Support

MagpieTTS has evolved its embedding table layout over time. For checkpoints created before April 2025, you may need to enable legacy mode.

Using Legacy Checkpoints

With the Inference Script

python examples/tts/magpietts_inference.py \
    --hparams_files /path/to/hparams.yaml \
    --checkpoint_files /path/to/old_checkpoint.ckpt \
    --codecmodel_path /path/to/codec.nemo \
    --legacy_codebooks \
    --legacy_text_conditioning \
    ...

With Hydra Command Line

# For decoder_context_tts models
python examples/tts/magpietts.py \
    ... \
    +model.forced_num_all_tokens_per_codebook=2048 \
    +model.forced_audio_eos_id=2047 \
    +model.forced_audio_bos_id=2046 \
    +model.forced_context_audio_eos_id=2045 \
    +model.forced_context_audio_bos_id=2044

# For other model types
python examples/tts/magpietts.py \
    ... \
    +model.forced_num_all_tokens_per_codebook=2048 \
    +model.forced_audio_eos_id=2047 \
    +model.forced_audio_bos_id=2046 \
    +model.forced_context_audio_eos_id=2047 \
    +model.forced_context_audio_bos_id=2046

Embedding Table Layouts

Version Layout Notes
Legacy (pre-April 2025) Special tokens at indices 2044-2047 Use --legacy_codebooks
Current Special tokens immediately after codec tokens (2016-2023) Automatic

Datasets

Supported Evaluation Datasets

Dataset Description
libritts_test_clean LibriTTS test-clean subset
libritts_seen LibriTTS seen speakers evaluation
vctk VCTK multi-speaker dataset
riva_hard_digits RIVA challenging digits
riva_hard_letters RIVA challenging letters
riva_hard_money RIVA challenging monetary values
riva_hard_short RIVA challenging short utterances

Manifest Format

MagpieTTS expects JSONL manifest files with the following format:

{
  "audio_filepath": "path/to/target_audio.wav",
  "text": "The transcript text",
  "duration": 5.2,
  "context_audio_filepath": "path/to/context_audio.wav",
  "context_audio_duration": 5.0
}

References

  1. NVIDIA NeMo Framework: https://github.com/NVIDIA/NeMo
  2. Fast Conformer with Linearly Scalable Attention: arXiv:2305.05084
  3. Attention Is All You Need: arXiv:1706.03762
  4. Direct Preference Optimization (DPO): arXiv:2305.18290
  5. Group Relative Policy Optimization (GRPO): arXiv:2402.03300

Discover More from NVIDIA

For documentation, deployment guides, enterprise-ready APIs, and the latest open modelsβ€”including Nemotron and other cutting-edge speech, translation, and generative AIβ€”visit the NVIDIA Developer Portal.

Explore More:


License: Apache 2.0 | Copyright: Β© 2025 NVIDIA Corporation