Spaces:

subhankarg
/

MagpieTTS_Internal_Demo

Runtime error

App Files Files Community

MagpieTTS_Internal_Demo / examples /tts /MagpieTTS_README.md

subhankarg

Upload folder using huggingface_hub

0558aa4 verified 11 days ago

preview code

raw

history blame contribute delete

19.8 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

MagpieTTS

MagpieTTS is NVIDIA's state-of-the-art neural Text-to-Speech (TTS) model that generates high-quality, natural-sounding speech from text using audio or text context for voice cloning and style transfer. Built on the NeMo Framework, MagpieTTS leverages transformer-based architectures with neural audio codecs to produce expressive, controllable speech synthesis.

Key Features
Model Architecture
NVIDIA NeMo
How to Use This Model
- Quick Start: Inference
- Training
Inference Parameters
Evaluation Metrics
Advanced Features
- Frame Stacking
- Preference Optimization (DPO/GRPO)
Legacy Checkpoint Support
Datasets
References

Key Features

MagpieTTS offers several key advantages for speech synthesis:

Feature	Description
Zero-Shot Voice Cloning	Clone any voice with just a few seconds of reference audio
Text Context Support	Use text descriptions to control speaker style and characteristics
High-Quality Audio	Generates 24kHz audio using neural audio codecs
Classifier-Free Guidance	Enhanced generation quality with CFG support
Frame Stacking	Accelerated inference through multi-frame processing
Preference Optimization	DPO and GRPO support for improved output quality
Attention Prior	Monotonic alignment enforcement for robust synthesis
Local Transformer	Fast codebook prediction with MaskGit or autoregressive decoding

Model Architecture

MagpieTTS is built on a transformer encoder-decoder architecture that processes text and audio context to generate neural audio codec tokens. The model learns to synthesize speech by predicting discrete audio tokens from a neural audio codec, conditioned on text transcripts and speaker context.

Core Components

1. Encoder

The encoder processes the input transcript and produces contextualized representations:

Tokenization: Text is converted to IPA phonemes using the IPA tokenizer with grapheme-to-phoneme (G2P) conversion, or using a BPE character tokenizer for multilingual support
Architecture: Transformer encoder with causal self-attention
Output: Text representations used as conditioning for the decoder via cross-attention

# Default encoder configuration
encoder:
  n_layers: 6
  d_model: 768
  d_ffn: 3072
  sa_n_heads: 12
  is_causal: true

2. Context Processing

MagpieTTS supports two types of context for voice cloning and style transfer:

Context Type	Description	Processing
Audio Context	Reference speech audio (typically 5 seconds)	Encoded via audio codec → embedded → fed to decoder
Text Context	Textual description of speaker/style	Tokenized via ByT5 → embedded → fed to decoder

The context is prepended to the decoder input, allowing the model to attend to speaker characteristics during generation.

3. Decoder

The decoder is the core generation component that autoregressively predicts audio codec tokens:

Architecture: Transformer decoder with causal self-attention and cross-attention to encoder output
Input: Context embeddings + previously generated audio token embeddings
Cross-Attention: Attends to text encoder output for content alignment
Output: Logits for all codebooks at each timestep

# Default decoder configuration  
decoder:
  n_layers: 12
  d_model: 768
  d_ffn: 3072
  sa_n_heads: 12
  xa_n_heads: 1       # Cross-attention heads
  has_xattn: true
  is_causal: true

4. Local Transformer (Optional)

For models using multiple codebooks, the Local Transformer refines per-frame predictions:

Mode	Description
Autoregressive	Predicts codebooks sequentially within each frame
MaskGit	Parallel prediction with iterative refinement

The Local Transformer takes the decoder's hidden state and generates tokens for all codebooks at that timestep.

5. Audio Codec

MagpieTTS uses a neural audio codec to convert between waveforms and discrete tokens:

Encoding: Raw audio → Discrete tokens (multiple codebooks)
Decoding: Discrete tokens → High-quality 24kHz waveform
Frame Rate: Typically 21 Hz (frames per second)
Codebooks: Multiple parallel codebooks (e.g., 8) for high-fidelity reconstruction

Generation Process

During inference, MagpieTTS generates audio through the following steps:

1. ENCODE TEXT
   └── Transcript → IPA Tokens → Text Encoder → Text Representations

2. PREPARE CONTEXT  
   └── Context Audio → Audio Codec → Context Embeddings
   └── OR: Context Text → ByT5 Tokens → Context Embeddings

3. AUTOREGRESSIVE DECODING
   └── For each timestep t:
       ├── Input: [Context Embeddings, Audio Tokens 0..t-1]
       ├── Decoder: Cross-attend to text, self-attend causally
       ├── Output: Hidden state for timestep t
       └── Local Transformer: Hidden → Codebook tokens

4. DECODE AUDIO
   └── All Tokens → Audio Codec Decoder → Waveform (24kHz)

Attention Mechanisms

MagpieTTS employs multiple attention mechanisms for robust generation:

Mechanism	Purpose
Causal Self-Attention	Autoregressive generation in decoder
Cross-Attention	Align audio generation with text content
Attention Prior	Beta-binomial prior for monotonic alignment
Alignment Encoder	Learned alignment for attention guidance

Training Objectives

The model is trained with multiple loss functions:

Loss	Description	Scale
Codebook Loss	Cross-entropy on predicted audio tokens	1.0
Alignment Loss	Forward-sum loss for monotonic attention	0.002
Local Transformer Loss	Auxiliary loss for codebook prediction	1.0
Alignment Encoder Loss	Auxiliary loss for learned alignment	1.0

Supported Model Types

MagpieTTS supports multiple architecture variants optimized for different use cases:

Model Type	Description	Best For
`decoder_context_tts`	Text → Encoder; Context audio/text + target audio → Decoder. Fixed-size context (5 seconds).	Standard voice cloning
`decoder_ce`	Same as above with additional Context Encoder network between context and decoder input.	Enhanced context processing

NVIDIA NeMo

To train, fine-tune, or perform inference with MagpieTTS, you need to install NVIDIA NeMo. We recommend installing it after you've installed the latest PyTorch version.

# Install system dependencies
apt-get update && apt-get install -y libsndfile1 ffmpeg

# Install NeMo with TTS support
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[tts]

How to Use This Model

Quick Start: Inference

MagpieTTS can be loaded and used for inference in multiple ways.

Method 1: Using the Inference Script

The recommended way to run inference is using the provided inference script:

# Inference from a .nemo checkpoint
python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/model.nemo \
    --datasets libritts_test_clean \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --temperature 0.6 \
    --topk 80

# Inference from hparams + checkpoint with evaluation
python examples/tts/magpietts_inference.py \
    --hparams_files /path/to/hparams.yaml \
    --checkpoint_files /path/to/model.ckpt \
    --datasets libritts_test_clean,vctk \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --run_evaluation \
    --num_repeats 3 \
    --use_cfg \
    --cfg_scale 2.5

Method 2: Python API

from nemo.collections.tts.models import MagpieTTSModel

# Load from .nemo file
model = MagpieTTSModel.restore_from("/path/to/magpietts.nemo")
model.eval()
model.cuda()

# Prepare your batch (see MagpieTTSDataset for data format)
# batch = {...}

# Run inference
predicted_audio, predicted_audio_lens, _, _, rtf_metrics, cross_attn_maps, _ = model.infer_batch(
    batch,
    max_decoder_steps=500,
    temperature=0.6,
    topk=80,
    use_cfg=True,
    cfg_scale=2.5,
)

# Save audio
import soundfile as sf
audio_np = predicted_audio[0].cpu().numpy()[:predicted_audio_lens[0]]
sf.write("output.wav", audio_np, model.sample_rate)

Training

Train MagpieTTS using Hydra configuration:

python examples/tts/magpietts.py \
    --config-name=magpietts \
    max_epochs=100 \
    batch_size=16 \
    model.codecmodel_path=/path/to/codec.nemo \
    exp_manager.exp_dir=/path/to/experiments \
    +train_ds_meta.libritts.manifest_path=/path/to/train_manifest.json \
    +train_ds_meta.libritts.audio_dir=/path/to/audio \
    +val_ds_meta.libritts_val.manifest_path=/path/to/val_manifest.json \
    +val_ds_meta.libritts_val.audio_dir=/path/to/audio

Inference Parameters

Parameter	Default	Description
`temperature`	0.6	Sampling temperature for token generation. Lower = more deterministic.
`topk`	80	Top-k sampling parameter. Limits vocabulary to k most likely tokens.
`max_decoder_steps`	440	Maximum number of decoder steps (frames to generate).
`use_cfg`	False	Enable Classifier-Free Guidance for improved quality.
`cfg_scale`	2.5	CFG scale factor. Higher = stronger guidance.
`apply_attention_prior`	False	Apply monotonic attention prior for alignment.
`attention_prior_epsilon`	0.1	Epsilon value for attention prior.
`use_local_transformer`	False	Use local transformer for codebook prediction.
`maskgit_n_steps`	3	Number of MaskGit refinement steps (if using MaskGit).

EOS Detection Methods

Method	Description
`argmax_any`	Stop when any codebook predicts EOS via argmax
`argmax_or_multinomial_any`	Stop when any codebook predicts EOS (default)
`argmax_all`	Stop when all codebooks predict EOS via argmax
`argmax_zero_cb`	Stop when codebook 0 predicts EOS via argmax

Evaluation Metrics

MagpieTTS evaluation includes multiple quality metrics:

Metric	Description
CER	Character Error Rate - measures transcription accuracy
WER	Word Error Rate - measures transcription accuracy at word level
SSIM	Speaker Similarity - cosine similarity between speaker embeddings
UTMOSv2	Mean Opinion Score prediction for audio quality
RTF	Real-Time Factor - inference speed metric

Run evaluation with the inference script:

python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/model.nemo \
    --datasets libritts_test_clean \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --run_evaluation \
    --num_repeats 3 \
    --cer_target 0.1 \
    --ssim_target 0.8

Advanced Features

Frame Stacking

Frame stacking accelerates inference by having the base decoder process multiple consecutive audio frames in a single forward pass.

Overview

In this two-stage approach:

The base decoder processes multiple frames at once, producing a single latent representation for each group (stack) of frames
The Local Transformer then generates the individual frames × codebooks tokens

Configuration

Enable frame stacking by setting frame_stacking_factor > 1 in your YAML config:

model:
  frame_stacking_factor: 2  # Process 2 frames per decoder step
  local_transformer_type: "autoregressive"  # Required with frame stacking
  local_transformer_n_layers: 1
  local_transformer_n_heads: 1
  local_transformer_hidden_dim: 256

Speed Benefits

The Local Transformer is much faster than the base decoder due to:

Fewer parameters: The LT decoder is lightweight compared to the base decoder
Shorter sequences: The LT decoder only attends to the current frame stack and latent, not the entire sequence

Current Limitations

Online code extraction combined with frame-stacking is not yet implemented
Alignment encoder with frame-stacking is not yet tested
CTC loss with frame-stacking is not yet implemented

Preference Optimization (DPO/GRPO)

MagpieTTS supports both offline (DPO) and online (GRPO) preference optimization for improved output quality.

Offline Preference Alignment (DPO/RPO)

Step 1: Create text-context pairs

python scripts/magpietts/dpo/create_text_contextpairs.py \
    --challenging_texts /path/to/challenging_texts.txt \
    --regular_texts_for_audiocontext /path/to/regular_texts.txt \
    --regular_texts_for_textcontext /path/to/text_context_texts.txt \
    --audio_contexts /path/to/audio_contexts.json \
    --text_contexts /path/to/text_contexts.txt \
    --output_manifest /path/to/text_context_pairs.json \
    --nsamples_perpair 6

Step 2: Generate audios for each pair

python examples/tts/magpietts.py \
    --config-name=magpietts_po_inference \
    mode=test \
    batch_size=64 \
    +init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
    exp_manager.exp_dir=/path/to/po_exp \
    +test_ds_meta.textcontextpairs.manifest_path=/path/to/text_context_pairs.json \
    +test_ds_meta.textcontextpairs.audio_dir="/" \
    model.codecmodel_path=/path/to/codec.nemo

Step 3: Create chosen-rejected pairs

python scripts/magpietts/dpo/create_preference_pairs.py \
    --input_manifest /path/to/text_context_pairs.json \
    --generated_audio_dir /path/to/po_exp/audio \
    --group_size 6 \
    --cer_threshold 0.01 \
    --val_size 256

Step 4: DPO Finetuning

python examples/tts/magpietts.py \
    batch_size=4 \
    +init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
    +mode="dpo_train" \
    max_epochs=10 \
    +model.dpo_beta=0.01 \
    model.train_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
    model.validation_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
    +train_ds_meta.dpopreftrain.manifest_path=/path/to/dpo_train_manifest.json \
    model.optim.lr=2e-7

Online Preference Optimization (GRPO)

GRPO generates samples online and optimizes based on computed rewards (CER, SSIM, PESQ).

python examples/tts/magpietts.py \
    batch_size=2 \
    +init_from_ptl_ckpt=/path/to/base_checkpoint.ckpt \
    +mode="onlinepo_train" \
    +model.num_generations_per_item=12 \
    +model.reference_free=true \
    +model.cer_reward_weight=0.33 \
    +model.ssim_reward_weight=0.33 \
    +model.pesq_reward_weight=0.33 \
    +model.use_pesq=true \
    +model.loss_type="grpo" \
    +model.scale_rewards=true \
    +model.inference_temperature=0.8 \
    +model.use_kv_cache_during_online_po=true \
    model.optim.lr=1e-7 \
    trainer.precision=32 \
    +trainer.gradient_clip_val=2.5

Key GRPO Parameters:

Parameter	Default	Description
`num_generations_per_item`	12	Number of samples generated per batch item
`reference_free`	True	Skip KL divergence loss term
`cer_reward_weight`	0.33	Weight for CER reward
`ssim_reward_weight`	0.33	Weight for speaker similarity reward
`pesq_reward_weight`	0.33	Weight for PESQ audio quality reward
`loss_type`	"grpo"	Use "grpo" or "dr_grpo"
`scale_rewards`	True	Normalize advantages by std deviation

Legacy Checkpoint Support

MagpieTTS has evolved its embedding table layout over time. For checkpoints created before April 2025, you may need to enable legacy mode.

Using Legacy Checkpoints

With the Inference Script

python examples/tts/magpietts_inference.py \
    --hparams_files /path/to/hparams.yaml \
    --checkpoint_files /path/to/old_checkpoint.ckpt \
    --codecmodel_path /path/to/codec.nemo \
    --legacy_codebooks \
    --legacy_text_conditioning \
    ...

With Hydra Command Line

# For decoder_context_tts models
python examples/tts/magpietts.py \
    ... \
    +model.forced_num_all_tokens_per_codebook=2048 \
    +model.forced_audio_eos_id=2047 \
    +model.forced_audio_bos_id=2046 \
    +model.forced_context_audio_eos_id=2045 \
    +model.forced_context_audio_bos_id=2044

# For other model types
python examples/tts/magpietts.py \
    ... \
    +model.forced_num_all_tokens_per_codebook=2048 \
    +model.forced_audio_eos_id=2047 \
    +model.forced_audio_bos_id=2046 \
    +model.forced_context_audio_eos_id=2047 \
    +model.forced_context_audio_bos_id=2046

Embedding Table Layouts

Version	Layout	Notes
Legacy (pre-April 2025)	Special tokens at indices 2044-2047	Use `--legacy_codebooks`
Current	Special tokens immediately after codec tokens (2016-2023)	Automatic

Datasets

Supported Evaluation Datasets

Dataset	Description
`libritts_test_clean`	LibriTTS test-clean subset
`libritts_seen`	LibriTTS seen speakers evaluation
`vctk`	VCTK multi-speaker dataset
`riva_hard_digits`	RIVA challenging digits
`riva_hard_letters`	RIVA challenging letters
`riva_hard_money`	RIVA challenging monetary values
`riva_hard_short`	RIVA challenging short utterances

Manifest Format

MagpieTTS expects JSONL manifest files with the following format:

{
  "audio_filepath": "path/to/target_audio.wav",
  "text": "The transcript text",
  "duration": 5.2,
  "context_audio_filepath": "path/to/context_audio.wav",
  "context_audio_duration": 5.0
}

References

NVIDIA NeMo Framework: https://github.com/NVIDIA/NeMo
Fast Conformer with Linearly Scalable Attention: arXiv:2305.05084
Attention Is All You Need: arXiv:1706.03762
Direct Preference Optimization (DPO): arXiv:2305.18290
Group Relative Policy Optimization (GRPO): arXiv:2402.03300

Discover More from NVIDIA

For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal.