Add codebase analysis documentation and update gitignore

- Added Rust build artifact patterns to .gitignore
- Included codebase exploration and analysis documents
- SOURCE_FILE_LISTING.txt: Complete Python source inventory
- DIRECTORY_STRUCTURE.txt: Project structure overview
- CODEBASE_ANALYSIS.md: Architecture and component analysis
- EXPLORATION_SUMMARY.md: Conversion planning notes

Files changed (5) hide show

.gitignore +4 -0
CODEBASE_ANALYSIS.md +594 -0
DIRECTORY_STRUCTURE.txt +224 -0
EXPLORATION_SUMMARY.md +283 -0
SOURCE_FILE_LISTING.txt +513 -0

.gitignore CHANGED Viewed

@@ -15,3 +15,7 @@ build/
 .venv
 checkpoints/*
 __MACOSX

 .venv
 checkpoints/*
 __MACOSX
+# Rust build artifacts
+/target/
+**/*.rs.bk

CODEBASE_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,594 @@

+# IndexTTS-Rust Comprehensive Codebase Analysis
+## Executive Summary
+**IndexTTS** is an **industrial-level, controllable, and efficient zero-shot Text-To-Speech (TTS) system** currently implemented in **Python** using PyTorch. The project is being converted to Rust (as indicated by the branch name `claude/convert-to-rust-01USgPYEqMyp5KXjjFNVwztU`).
+**Key Statistics:**
+- **Total Python Files:** 194
+- **Total Lines of Code:** ~25,000+ (not counting dependencies)
+- **Current Version:** IndexTTS 1.5 (latest with stability improvements, especially for English)
+- **No Rust code exists yet** - this is a fresh conversion project
+---
+## 1. PROJECT STRUCTURE
+### Root Directory Layout
+```
+IndexTTS-Rust/
+├── indextts/              # Main package (194 .py files)
+│   ├── gpt/               # GPT-based model implementation
+│   ├── BigVGAN/           # Vocoder for audio synthesis
+│   ├── s2mel/             # Semantic-to-Mel spectrogram conversion
+│   ├── utils/             # Text processing, feature extraction, utilities
+│   └── vqvae/             # Vector Quantized VAE components
+├── examples/              # Sample audio files and test cases
+├── tests/                 # Test files for regression testing
+├── tools/                 # Utility scripts and i18n support
+├── webui.py               # Gradio-based web interface (18KB)
+├── cli.py                 # Command-line interface
+├── requirements.txt       # Python dependencies
+└── archive/               # Historical documentation
+```
+---
+## 2. CURRENT IMPLEMENTATION (PYTHON)
+### Programming Language & Framework
+- **Language:** Python 3.x
+- **Deep Learning Framework:** PyTorch (primary dependency)
+- **Model Format:** HuggingFace compatible (.safetensors)
+### Key Dependencies (requirements.txt)
+| Dependency | Version | Purpose |
+|-----------|---------|---------|
+| torch | (implicit) | Deep learning framework |
+| transformers | 4.52.1 | HuggingFace transformers library |
+| librosa | 0.10.2.post1 | Audio processing |
+| numpy | 1.26.2 | Numerical computing |
+| accelerate | 1.8.1 | Distributed training/inference |
+| deepspeed | 0.17.1 | Inference optimization |
+| torchaudio | (implicit) | Audio I/O |
+| safetensors | 0.5.2 | Model serialization |
+| gradio | (latest) | Web UI framework |
+| modelscope | 1.27.0 | Model hub integration |
+| jieba | 0.42.1 | Chinese text tokenization |
+| g2p-en | 2.1.0 | English phoneme conversion |
+| sentencepiece | (latest) | BPE tokenization |
+| descript-audiotools | 0.7.2 | Audio manipulation |
+| cn2an | 0.5.22 | Chinese number normalization |
+| WeTextProcessing / wetext | (conditional) | Text normalization (Linux/macOS) |
+---
+## 3. MAIN FUNCTIONALITY - THE TTS PIPELINE
+### What IndexTTS Does
+**IndexTTS is a zero-shot multi-lingual TTS system that:**
+1. **Takes text input** (Chinese, English, or mixed)
+2. **Takes a voice reference audio** (speaker prompt)
+3. **Generates high-quality speech** in the speaker's voice
+4. **Supports multiple control mechanisms:**
+   - Pinyin-based pronunciation control (for Chinese)
+   - Pause control via punctuation
+   - Emotion vector manipulation (8 dimensions)
+   - Emotion text guidance via Qwen model
+   - Style reference audio
+### Core TTS Pipeline (infer_v2.py - 739 lines)
+```
+Input Text
+    ↓
+Text Normalization (TextNormalizer)
+    ├─ Chinese-specific normalization
+    ├─ English-specific normalization
+    ├─ Pinyin tone extraction/preservation
+    └─ Name entity handling
+    ↓
+Text Tokenization (TextTokenizer + SentencePiece)
+    ├─ CJK character handling
+    └─ BPE encoding
+    ↓
+Semantic Encoding (w2v-BERT model)
+    ├─ Input: Text tokens + Reference audio
+    ├─ Process: Semantic codec (RepCodec)
+    └─ Output: Semantic codes
+    ↓
+Speaker Conditioning
+    ├─ Extract features from reference audio
+    ├─ CAMPPlus speaker embedding
+    ├─ Emotion embedding (from reference or text)
+    └─ Mel spectrogram reference
+    ↓
+GPT-based Sequence Generation (UnifiedVoice)
+    ├─ Semantic tokens → Mel tokens
+    ├─ Conformer-based speaker conditioning
+    ├─ Perceiver-based attention pooling
+    └─ Emotion control via vectors or text
+    ↓
+Length Regulation (s2mel)
+    ├─ Acoustic code expansion
+    ├─ Flow matching for duration modeling
+    └─ CFM (Continuous Flow Matching) estimator
+    ↓
+BigVGAN Vocoder
+    ├─ Mel spectrogram → Waveform
+    ├─ Uses anti-aliased activation functions
+    ├─ Optional CUDA kernel optimization
+    └─ Optional DeepSpeed acceleration
+    ↓
+Output Audio Waveform (22050 Hz)
+```
+---
+## 4. KEY ALGORITHMS AND COMPONENTS NEEDING RUST CONVERSION
+### A. Text Processing Pipeline
+**TextNormalizer (front.py - ~500 lines)**
+- Chinese text normalization using WeTextProcessing/wetext
+- English text normalization
+- Pinyin tone extraction and preservation
+- Name entity detection and preservation
+- Character mapping and replacement
+- Pattern matching using regex
+**TextTokenizer (front.py - ~200 lines)**
+- SentencePiece BPE tokenization
+- CJK character tokenization
+- Special token handling (BOS, EOS, UNK)
+- Vocabulary management
+### B. Neural Network Components
+#### 1. **UnifiedVoice GPT Model** (model_v2.py - 747 lines)
+   - Multi-layer transformer (configurable depth)
+   - Speaker conditioning via Conformer encoder
+   - Perceiver resampler for attention pooling
+   - Emotion conditioning encoder
+   - Position embeddings (learned)
+   - Mel and text embeddings
+   - Final layer norm + linear output layer
+#### 2. **Conformer Encoder** (conformer_encoder.py - 520 lines)
+   - Conformer blocks with attention + convolution
+   - Multi-head self-attention with relative position bias
+   - Positionwise feed-forward networks
+   - Layer normalization
+   - Subsampling layers (Conv2d with various factors)
+   - Positional encoding (absolute and relative)
+#### 3. **Perceiver Resampler** (perceiver.py - 317 lines)
+   - Latent queries (learnable embeddings)
+   - Cross-attention with context
+   - Feed-forward networks
+   - Dimension projection
+#### 4. **BigVGAN Vocoder** (models.py - ~1000 lines)
+   - Multi-scale convolution blocks (AMPBlock1, AMPBlock2)
+   - Anti-aliased activation functions (Snake, SnakeBeta)
+   - Spectral normalization
+   - Transposed convolution upsampling
+   - Weight normalization
+   - Optional CUDA kernel for activation
+#### 5. **S2Mel (Semantic-to-Mel) Model** (s2mel/modules/)
+   - Flow matching / CFM (Continuous Flow Matching)
+   - Length regulator
+   - Diffusion transformer
+   - Acoustic codec quantization
+   - Style embeddings
+### C. Feature Extraction & Processing
+**Audio Processing (audio.py)**
+- Mel spectrogram computation using librosa
+- Hann windowing and STFT
+- Dynamic range compression/decompression
+- Spectral normalization
+**Semantic Models**
+- W2V-BERT (wav2vec 2.0 BERT) embeddings
+- RepCodec (semantic codec with vector quantization)
+- Amphion Codec encoders/decoders
+**Speaker Features**
+- CAMPPlus speaker embedding (192-dim)
+- Campplus model inference
+- Mel-based reference features
+### D. Model Loading & Configuration
+**Checkpoint Loading** (checkpoint.py - ~50 lines)
+- Model weight restoration from .safetensors/.pt files
+**HuggingFace Integration**
+- Model hub downloads
+- Configuration loading (OmegaConf)
+**Configuration System** (YAML-based)
+- Model architecture parameters
+- Training/inference settings
+- Dataset configuration
+- Vocoder settings
+---
+## 5. EXTERNAL MODELS USED
+### Pre-trained Models (Downloaded from HuggingFace)
+| Model | Source | Purpose | Size | Parameters |
+|-------|--------|---------|------|-----------|
+| IndexTTS-2 | IndexTeam/IndexTTS-2 | Main TTS model | ~2GB | Various checkpoints |
+| W2V-BERT-2.0 | facebook/w2v-bert-2.0 | Semantic feature extraction | ~1GB | 614M |
+| MaskGCT | amphion/MaskGCT | Semantic codec | - | - |
+| CAMPPlus | funasr/campplus | Speaker embedding | ~100MB | - |
+| BigVGAN v2 | nvidia/bigvgan_v2_22khz_80band_256x | Vocoder | ~100MB | - |
+| Qwen Model | (via modelscope) | Emotion text guidance | Variable | - |
+### Model Component Breakdown
+```
+Checkpoint Files Loaded:
+├── gpt_checkpoint.pth          # UnifiedVoice model weights
+├── s2mel_checkpoint.pth        # Semantic-to-Mel model
+├── bpe_model.model             # SentencePiece tokenizer
+├── emotion_matrix.pt           # Emotion embedding vectors (8-dim)
+├── speaker_matrix.pt           # Speaker embedding matrix
+├── w2v_stat.pt                 # Semantic model statistics (mean/std)
+├── qwen_emo_path/              # Qwen-based emotion detector
+└── vocoder config              # BigVGAN vocoder config
+```
+---
+## 6. INFERENCE MODES & CAPABILITIES
+### A. Single Text Generation
+```python
+tts.infer(
+    spk_audio_prompt="voice.wav",
+    text="Hello world",
+    output_path="output.wav",
+    emo_audio_prompt=None,      # Optional emotion reference
+    emo_alpha=1.0,              # Emotion weight
+    emo_vector=None,            # Direct emotion control [0-1 values]
+    use_emo_text=False,         # Generate emotion from text
+    emo_text=None,              # Text for emotion extraction
+    interval_silence=200        # Silence between segments (ms)
+)
+```
+### B. Batch/Fast Inference
+```python
+tts.infer_fast(...)  # Parallel segment generation
+```
+### C. Multi-language Support
+- **Chinese (Simplified & Traditional):** Full pinyin support
+- **English:** Phoneme-based
+- **Mixed:** Chinese + English in single utterance
+### D. Emotion Control Methods
+1. **Reference Audio:** Extract from emotion_audio_prompt
+2. **Emotion Vectors:** Direct 8-dimensional control
+3. **Text-based:** Use Qwen model to detect emotion from text
+4. **Speaker-based:** Use speaker's natural emotion
+### E. Punctuation-based Pausing
+- Periods, commas, question marks, exclamation marks trigger pauses
+- Pause duration controlled via configuration
+---
+## 7. MAJOR COMPONENTS BREAKDOWN
+### indextts/gpt/ (16,953 lines)
+**Purpose:** GPT-based sequence-to-sequence modeling
+**Files:**
+- `model_v2.py` (747L) - UnifiedVoice implementation, GPT2InferenceModel
+- `model.py` (713L) - Original model (v1)
+- `conformer_encoder.py` (520L) - Conformer speaker encoder
+- `perceiver.py` (317L) - Perceiver attention mechanism
+- `transformers_*.py` (~13,000L) - HuggingFace transformer implementations (customized)
+### indextts/BigVGAN/ (6+ files, ~1000+ lines)
+**Purpose:** Neural vocoder for mel-to-audio conversion
+**Key Files:**
+- `models.py` - BigVGAN architecture with AMPBlocks
+- `ECAPA_TDNN.py` - Speaker encoder
+- `activations.py` - Snake/SnakeBeta activation functions
+- `alias_free_activation/` - Anti-aliasing filters (CUDA + Torch versions)
+- `alias_free_torch/` - Pure PyTorch fallback
+- `nnet/` - Network modules (normalization, CNN, linear)
+### indextts/s2mel/ (~500+ lines)
+**Purpose:** Semantic tokens → Mel spectrogram conversion
+**Key Files:**
+- `modules/audio.py` - Mel spectrogram computation
+- `modules/commons.py` - Common utilities
+- `modules/layers.py` - Neural network layers
+- `modules/length_regulator.py` - Duration modeling
+- `modules/flow_matching.py` - Continuous flow matching
+- `modules/diffusion_transformer.py` - Diffusion-based generation
+- `modules/rmvpe.py` - Pitch extraction
+- `modules/bigvgan/` - BigVGAN vocoder
+- `dac/` - DAC (Descript Audio Codec)
+### indextts/utils/ (12+ files, ~500 lines)
+**Purpose:** Text processing, feature extraction, utilities
+**Key Files:**
+- `front.py` (700L) - TextNormalizer, TextTokenizer
+- `maskgct_utils.py` (250L) - Semantic codec builders
+- `arch_util.py` - Architecture utilities (AttentionBlock)
+- `checkpoint.py` - Model loading
+- `xtransformers.py` (1600L) - Transformer utilities
+- `feature_extractors.py` - Mel spectrogram features
+- `typical_sampling.py` - Sampling strategies
+- `maskgct/` - MaskGCT codec components (~100+ files)
+### indextts/utils/maskgct/ (~100+ Python files)
+**Purpose:** MaskGCT (Masked Generative Codec Transformer) implementation
+**Components:**
+- `models/codec/` - Various audio codecs (Amphion, FACodec, SpeechTokenizer, NS3, VEVo, KMeans)
+- `models/tts/maskgct/` - TTS-specific implementations
+- Multiple codec variants with quantization
+---
+## 8. CONFIGURATION & MODEL DOWNLOADING
+### Configuration System (OmegaConf YAML)
+Example config.yaml structure:
+```yaml
+gpt:
+  layers: 8
+  model_dim: 512
+  heads: 8
+  max_text_tokens: 120
+  max_mel_tokens: 250
+  stop_mel_token: 8193
+  conformer_config: {...}
+vocoder:
+  name: "nvidia/bigvgan_v2_22khz_80band_256x"
+s2mel:
+  checkpoint: "models/s2mel.pth"
+  preprocess_params:
+    sr: 22050
+    spect_params:
+      n_fft: 1024
+      hop_length: 256
+      n_mels: 80
+dataset:
+  bpe_model: "models/bpe.model"
+emotions:
+  num: [5, 6, 8, ...]  # Emotion vector counts per dimension
+w2v_stat: "models/w2v_stat.pt"
+```
+### Model Auto-download
+```python
+download_model_from_huggingface(
+    local_path="./checkpoints",
+    cache_path="./checkpoints/hf_cache"
+)
+```
+Preloads from HuggingFace:
+- IndexTeam/IndexTTS-2
+- amphion/MaskGCT
+- funasr/campplus
+- facebook/w2v-bert-2.0
+- nvidia/bigvgan_v2_22khz_80band_256x
+---
+## 9. INTERFACES
+### A. Command Line (cli.py - 64 lines)
+```bash
+python -m indextts.cli "Text to synthesize" \
+  -v voice_prompt.wav \
+  -o output.wav \
+  -c checkpoints/config.yaml \
+  --model_dir checkpoints \
+  --fp16 \
+  -d cuda:0
+```
+### B. Web UI (webui.py - 18KB)
+Gradio-based interface with:
+- Real-time inference
+- Multiple emotion control modes
+- Example cases loading
+- Language selection (Chinese/English)
+- Batch processing
+- Cache management
+### C. Python API (infer_v2.py)
+```python
+from indextts.infer_v2 import IndexTTS2
+tts = IndexTTS2(
+    cfg_path="checkpoints/config.yaml",
+    model_dir="checkpoints",
+    use_fp16=True,
+    device="cuda:0"
+)
+audio = tts.infer(
+    spk_audio_prompt="speaker.wav",
+    text="Hello",
+    output_path="output.wav"
+)
+```
+---
+## 10. CRITICAL ALGORITHMS TO IMPLEMENT
+### Priority 1: Core Inference Pipeline
+1. **Text Normalization** - Pattern matching, phoneme handling
+2. **Text Tokenization** - SentencePiece integration
+3. **Semantic Encoding** - W2V-BERT model inference
+4. **GPT Generation** - Token-by-token generation with sampling
+5. **Vocoder** - BigVGAN mel-to-audio conversion
+### Priority 2: Feature Extraction
+1. **Mel Spectrogram** - STFT, librosa filters
+2. **Speaker Embeddings** - CAMPPlus inference
+3. **Emotion Encoding** - Vector quantization
+4. **Audio Loading/Processing** - Resampling, normalization
+### Priority 3: Advanced Features
+1. **Conformer Encoding** - Complex attention mechanism
+2. **Perceiver Pooling** - Cross-attention mechanisms
+3. **Flow Matching** - Continuous diffusion
+4. **Length Regulation** - Duration prediction
+### Priority 4: Optional Optimizations
+1. **CUDA Kernels** - Anti-aliased activations
+2. **DeepSpeed Integration** - Model parallelism
+3. **KV Cache** - Inference optimization
+---
+## 11. DATA FLOW EXAMPLE
+```
+Input: text="你好", voice="speaker.wav", emotion="happy"
+1. TextNormalizer.normalize("你好")
+   → "你好" (no change needed)
+2. TextTokenizer.encode("你好")
+   → [token_id_1, token_id_2, ...]
+3. Audio Loading & Processing:
+   - Load speaker.wav → 22050 Hz
+   - Extract W2V-BERT features
+   - Get semantic codes via RepCodec
+   - Extract CAMPPlus embedding (192-dim)
+   - Compute mel spectrogram
+4. Emotion Processing:
+   - If emotion vector: scale by emotion_alpha
+   - If emotion audio: extract embeddings
+   - Create emotion conditioning
+5. GPT Generation:
+   - Input: [semantic_codes, text_tokens]
+   - Output: mel_tokens (variable length)
+6. Length Regulation (s2mel):
+   - Input: mel_tokens + speaker_style
+   - Output: acoustic_codes (fine-grained tokens)
+7. BigVGAN Vocoding:
+   - Input: acoustic_codes → mel_spectrogram
+   - Output: waveform at 22050 Hz
+8. Post-processing:
+   - Optional silence insertion
+   - Audio normalization
+   - WAV file writing
+```
+---
+## 12. TESTING
+### Regression Tests (regression_test.py)
+Tests various scenarios:
+- Chinese text with pinyin tones
+- English text
+- Mixed Chinese/English
+- Long-form text
+- Names and entities
+- Special punctuation
+### Padding Tests (padding_test.py)
+- Variable length input handling
+- Batch processing
+- Edge cases
+---
+## 13. FILE STATISTICS SUMMARY
+| Category | Count | Lines |
+|----------|-------|-------|
+| Python Files | 194 | ~25,000+ |
+| GPT Module | 9 | 16,953 |
+| BigVGAN | 6+ | ~1,000+ |
+| Utils | 12+ | ~500 |
+| MaskGCT | 100+ | ~10,000+ |
+| S2Mel | 10+ | ~2,000+ |
+| Root Level | 3 | 730 |
+---
+## 14. KEY TECHNICAL CHALLENGES FOR RUST CONVERSION
+1. **PyTorch Model Loading** → Need ONNX export or custom binary format
+2. **Text Normalization Libraries** → May need Rust bindings or reimplementation
+3. **Complex Attention Mechanisms** → Transformers, Perceiver, Conformer
+4. **Mel Spectrogram Computation** → STFT, librosa filter banks
+5. **Quantization & Codecs** → Multiple codec implementations
+6. **Large Model Inference** → Optimization, batching, caching
+7. **CUDA Kernels** → Custom activation functions (if needed)
+8. **Web Server Integration** → Replace Gradio with Rust web framework
+---
+## 15. DEPENDENCY CONVERSION ROADMAP
+| Python Library | Rust Alternative | Priority |
+|---|---|---|
+| torch/transformers | ort, tch-rs, candle | Critical |
+| librosa | rustfft, dasp_signal | Critical |
+| sentencepiece | sentencepiece, tokenizers | Critical |
+| numpy | ndarray, nalgebra | Critical |
+| jieba | jieba-rs | High |
+| torchaudio | dasp, wav, hound | High |
+| gradio | actix-web, rocket, axum | Medium |
+| OmegaConf | serde, config-rs | Medium |
+| safetensors | safetensors-rs | High |
+---
+## Summary
+IndexTTS is a sophisticated, state-of-the-art TTS system with:
+- **194 Python files** across multiple specialized modules
+- **Multi-stage processing pipeline** from text to audio
+- **Advanced neural architectures** (Conformer, Perceiver, GPT, BigVGAN)
+- **Multi-language support** with emotion control
+- **Production-ready** with web UI and CLI interfaces
+- **Heavy reliance on PyTorch** and HuggingFace ecosystems
+- **Large external models** requiring careful integration
+The Rust conversion will require careful translation of:
+1. Complex text processing pipelines
+2. Neural network inference engines
+3. Audio DSP operations
+4. Model loading and management
+5. Web interface integration

DIRECTORY_STRUCTURE.txt ADDED Viewed

	@@ -0,0 +1,224 @@

+IndexTTS-Rust/ (Complete Directory Structure)
+│
+├── indextts/                                    # Main Python package (194 files)
+│   │
+│   ├── __init__.py                              # Package initialization
+│   ├── cli.py                                   # Command-line interface (64 lines)
+│   ├── infer.py                                 # Original inference (v1) - 690 lines
+│   ├── infer_v2.py                              # Main inference v2 - 739 lines ⭐⭐⭐
+│   │
+│   ├── gpt/                                     # GPT-based TTS model (9 files, 16,953 lines)
+│   │   ├── __init__.py
+│   │   ├── model.py                             # Original UnifiedVoice (713L)
+│   │   ├── model_v2.py                          # UnifiedVoice v2 ⭐⭐⭐ (747L)
+│   │   ├── conformer_encoder.py                 # Conformer encoder ⭐⭐ (520L)
+│   │   ├── perceiver.py                         # Perceiver resampler (317L)
+│   │   ├── conformer_encoder.py                 # Conformer components
+│   │   ├── transformers_gpt2.py                 # GPT2 implementation (1,878L)
+│   │   ├── transformers_generation_utils.py     # Generation utilities (4,747L)
+│   │   ├── transformers_beam_search.py          # Beam search (1,013L)
+│   │   └── transformers_modeling_utils.py       # Model utilities (5,525L)
+│   │
+│   ├── BigVGAN/                                 # Neural Vocoder (6+ files, ~1000+ lines)
+│   │   ├── __init__.py
+│   │   ├── models.py                            # BigVGAN architecture ⭐⭐⭐
+│   │   ├── ECAPA_TDNN.py                        # Speaker encoder
+│   │   ├── activations.py                       # Snake, SnakeBeta activations
+│   │   ├── utils.py                             # Helper functions
+│   │   │
+│   │   ├── alias_free_activation/               # CUDA kernel variants
+│   │   │   ├── cuda/
+│   │   │   │   ├── activation1d.py              # CUDA kernel loader
+│   │   │   │   └── load.py
+│   │   │   └── torch/
+│   │   │       ├── act.py                       # PyTorch activation
+│   │   │       ├── filter.py                    # Anti-aliasing filter
+│   │   │       └── resample.py                  # Resampling
+│   │   │
+│   │   ├── alias_free_torch/                    # PyTorch-only fallback
+│   │   │   ├── act.py
+│   │   │   ├── filter.py
+│   │   │   └── resample.py
+│   │   │
+│   │   └── nnet/                                # Network modules
+│   │       ├── linear.py
+│   │       ├── normalization.py
+│   │       └── CNN.py
+│   │
+│   ├── s2mel/                                   # Semantic-to-Mel Models (~500+ lines)
+│   │   ├── modules/                             # Core modules (10+ files)
+│   │   │   ├── audio.py                         # Mel-spectrogram computation ⭐
+│   │   │   ├── commons.py                       # Common utilities (21KB)
+│   │   │   ├── layers.py                        # NN layers (13KB)
+│   │   │   ├── length_regulator.py              # Duration modeling
+│   │   │   ├── flow_matching.py                 # Continuous flow matching
+│   │   │   ├── diffusion_transformer.py         # Diffusion model
+│   │   │   ├── rmvpe.py                         # Pitch extraction (22KB)
+│   │   │   ├── quantize.py                      # Quantization
+│   │   │   ├── encodec.py                       # EnCodec codec
+│   │   │   ├── wavenet.py                       # WaveNet implementation
+│   │   │   │
+│   │   │   ├── bigvgan/                         # BigVGAN vocoder
+│   │   │   │   ├── modules.py
+│   │   │   │   ├── config.json
+│   │   │   │   ├── bigvgan.py
+│   │   │   │   ├── alias_free_activation/      # Variants
+│   │   │   │   └── models.py
+│   │   │   │
+│   │   │   ├── vocos/                           # Vocos codec
+│   │   │   ├── hifigan/                         # HiFiGAN vocoder
+│   │   │   ├── openvoice/                       # OpenVoice components (11 files)
+│   │   │   ├── campplus/                        # CAMPPlus speaker encoder
+│   │   │   │   └── DTDNN.py                     # DTDNN architecture
+│   │   │   └── gpt_fast/                        # Fast GPT inference
+│   │   │
+│   │   ├── dac/                                 # DAC codec
+│   │   │   ├── model/
+│   │   │   ├── nn/
+│   │   │   └── utils/
+│   │   │
+│   │   └─��� (other s2mel implementations)
+│   │
+│   ├── utils/                                   # Text & Feature Utils (12+ files, ~500L)
+│   │   ├── __init__.py
+│   │   ├── front.py                             # TextNormalizer, TextTokenizer ⭐⭐⭐ (700L)
+│   │   ├── maskgct_utils.py                     # Semantic codec builders (250L)
+│   │   ├── arch_util.py                         # AttentionBlock, utilities
+│   │   ├── checkpoint.py                        # Model loading
+│   │   ├── xtransformers.py                     # Transformer utils (1,600L)
+│   │   ├── feature_extractors.py                # MelSpectrogramFeatures
+│   │   ├── common.py                            # Common functions
+│   │   ├── text_utils.py                        # Text utilities
+│   │   ├── typical_sampling.py                  # TypicalLogitsWarper sampling
+│   │   ├── utils.py                             # General utils
+│   │   ├── webui_utils.py                       # Web UI helpers
+│   │   ├── tagger_cache/                        # Text normalization cache
+│   │   │
+│   │   └── maskgct/                             # MaskGCT codec (100+ files, 10KB+)
+│   │       └── models/
+│   │           ├── codec/                       # Multiple codec implementations
+│   │           │   ├── amphion_codec/           # Amphion codec
+│   │           │   │   ├── codec.py
+│   │           │   │   ├── vocos.py
+│   │           │   │   └── quantize/            # Quantization
+│   │           │   │       ├── vector_quantize.py
+│   │           │   │       ├── residual_vq.py
+│   │           │   │       ├── factorized_vector_quantize.py
+│   │           │   │       └── lookup_free_quantize.py
+│   │           │   │
+│   │           │   ├── facodec/                 # FACodec variant
+│   │           │   │   ├── facodec_inference.py
+│   │           │   │   ├── modules/
+│   │           │   │   │   ├── commons.py
+│   │           │   │   │   ├── attentions.py
+│   │           │   │   │   ├── layers.py
+│   │           │   │   │   ├── quantize.py
+│   │           │   │   │   ├── wavenet.py
+│   │           │   │   │   ├── style_encoder.py
+│   │           │   │   │   ├── gradient_reversal.py
+│   │           │   │   │   └── JDC/ (pitch detection)
+│   │           │   │   └── alias_free_torch/    # Anti-aliasing
+│   │           │   │
+│   │           │   ├── speechtokenizer/         # Speech Tokenizer codec
+│   │           │   │   ├── model.py
+│   │           │   │   └── modules/
+│   │           │   │       ├── seanet.py
+│   │           │   │       ├── lstm.py
+│   │           │   │       ├── norm.py
+│   │           │   │       ├── conv.py
+│   │           │   │       └── quantization/
+│   │           │   │
+│   │           │   ├── ns3_codec/                # NS3 codec variant
+│   │           │   ├── vevo/                     # VEVo codec
+│   │           │   ├── kmeans/                   # KMeans codec
+│   │           │   ├── melvqgan/                 # MelVQ-GAN codec
+│   │           │   │
+│   │           │   ├── codec_inference.py
+│   │           │   ├── codec_sampler.py
+│   │           │   ├── codec_trainer.py
+│   │           │   └── codec_dataset.py
+│   │           │
+│   │           └── tts/
+│   │               └── maskgct/
+│   │                   ├── maskgct_s2a.py        # Semantic-to-acoustic
+│   │                   └── ckpt/
+│   │
+│   └── vqvae/                                   # Vector Quantized VAE
+│       ├── xtts_dvae.py                         # Discrete VAE (currently disabled)
+│       └── (other VAE components)
+│
+├── examples/                                    # Sample Data & Test Cases
+│   ├── cases.jsonl                              # Example test cases
+│   ├── voice_*.wav                              # Sample voice prompts (12 files)
+│   ├── emo_*.wav                                # Emotion reference samples (2 files)
+│   └── sample_prompt.wav                        # Default prompt (implied)
+│
+├── tests/                                       # Test Suite
+│   ├── regression_test.py                       # Main regression tests ⭐
+│   └── padding_test.py                          # Padding/batch tests
+│
+├── tools/                                       # Utility Scripts & i18n
+│   ├── download_files.py                        # Model downloading from HF
+│   └── i18n/                                    # Internationalization
+│       ├── i18n.py                              # Translation system
+│       ├── scan_i18n.py                         # i18n scanner
+│       └── locale/
+│           ├── en_US.json                       # English translations
+│           └── zh_CN.json                       # Chinese translations
+│
+├── archive/                                     # Historical Docs
+│   └── README_INDEXTTS_1_5.md                   # IndexTTS 1.5 documentation
+│
+├── webui.py                                     # Gradio Web UI ⭐⭐⭐ (18KB)
+├── cli.py                                       # Command-line interface
+├── requirements.txt                             # Python dependencies
+├── MANIFEST.in                                  # Package manifest
+├── .gitignore                                   # Git ignore rules
+├── .gitattributes                               # Git attributes
+└── LICENSE                                      # Apache 2.0 License
+═══════════════════════════════════════════════════════════════════════════════
+KEY FILES BY IMPORTANCE:
+═══════════════════════════════════════════════════════════════════════════════
+⭐⭐⭐ CRITICAL (Core Logic - MUST Convert First)
+  1. indextts/infer_v2.py              - Main inference pipeline (739L)
+  2. indextts/gpt/model_v2.py          - UnifiedVoice GPT model (747L)
+  3. indextts/utils/front.py           - Text processing (700L)
+  4. indextts/BigVGAN/models.py        - Vocoder (1000+L)
+  5. indextts/s2mel/modules/audio.py   - Mel-spectrogram (83L, critical DSP)
+⭐⭐ HIGH PRIORITY (Major Components)
+  1. indextts/gpt/conformer_encoder.py - Conformer blocks (520L)
+  2. indextts/gpt/perceiver.py         - Perceiver attention (317L)
+  3. indextts/utils/maskgct_utils.py   - Codec builders (250L)
+  4. indextts/s2mel/modules/commons.py - Common utilities (21KB)
+⭐ MEDIUM PRIORITY (Utilities & Optimization)
+  1. indextts/utils/xtransformers.py   - Transformer utils (1,600L)
+  2. indextts/BigVGAN/activations.py   - Activation functions
+  3. indextts/s2mel/modules/rmvpe.py   - Pitch extraction (22KB)
+OPTIONAL (Web UI, Tools)
+  1. webui.py                          - Gradio interface
+  2. tools/download_files.py           - Model downloading
+═══════════════════════════════════════════════════════════════════════════════
+TOTAL STATISTICS:
+═══════════════════════════════════════════════════════════════════════════════
+Total Python Files:        194
+Total Lines of Code:       ~25,000+
+GPT Module:                16,953 lines
+MaskGCT Codecs:            ~10,000+ lines
+S2Mel Models:              ~2,000+ lines
+BigVGAN:                   ~1,000+ lines
+Utils:                     ~500 lines
+Tests:                     ~100 lines
+Models Supported:          6 major HuggingFace models
+Languages:                 Chinese (full), English (full), Mixed
+Emotion Dimensions:        8-dimensional emotion control
+Audio Sample Rate:         22,050 Hz (primary)
+Max Text Tokens:           120
+Max Mel Tokens:            250
+Mel Spectrogram Bins:      80

EXPLORATION_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,283 @@

+# IndexTTS-Rust Codebase Exploration - Complete Summary
+## Overview
+I have conducted a **comprehensive exploration** of the IndexTTS-Rust codebase. This is a sophisticated zero-shot multi-lingual Text-to-Speech (TTS) system currently implemented in Python that is being converted to Rust.
+## Key Findings
+### Project Status
+- **Current State**: Pure Python implementation with PyTorch backend
+- **Target State**: Rust implementation (conversion in progress)
+- **Files**: 194 Python files across multiple specialized modules
+- **Code Volume**: ~25,000+ lines of Python code
+- **No Rust code exists yet** - this is a fresh rewrite opportunity
+### What IndexTTS Does
+IndexTTS is an **industrial-level text-to-speech system** that:
+1. Takes text input (Chinese, English, or mixed languages)
+2. Takes a reference speaker audio file (voice prompt)
+3. Generates high-quality speech in the speaker's voice with:
+   - Pinyin-based pronunciation control (for Chinese)
+   - Emotion control via 8-dimensional emotion vectors
+   - Text-based emotion guidance (via Qwen model)
+   - Punctuation-based pause control
+   - Style reference audio support
+### Performance Metrics
+- **Best in class**: WER 0.821 on Chinese test set, 1.606 on English
+- **Outperforms**: SeedTTS, CosyVoice2, F5-TTS, MaskGCT, others
+- **Multi-language**: Full Chinese + English support, mixed language support
+- **Speed**: Parallel inference available, batch processing support
+## Architecture Overview
+### Main Pipeline Flow
+```
+Text Input
+    ↓ (TextNormalizer)
+Normalized Text
+    ↓ (TextTokenizer + SentencePiece)
+Text Tokens
+    ↓ (W2V-BERT)
+Semantic Embeddings
+    ↓ (RepCodec)
+Semantic Codes + Speaker Features (CAMPPlus) + Emotion Vectors
+    ↓ (UnifiedVoice GPT Model)
+Mel-spectrogram Tokens
+    ↓ (S2Mel Length Regulator)
+Acoustic Codes
+    ↓ (BigVGAN Vocoder)
+Audio Waveform (22,050 Hz)
+```
+## Critical Components to Convert
+### Priority 1: MUST Convert First (Core Pipeline)
+1. **infer_v2.py** (739 lines) - Main inference orchestration
+2. **model_v2.py** (747 lines) - UnifiedVoice GPT model
+3. **front.py** (700 lines) - Text normalization and tokenization
+4. **BigVGAN/models.py** (1000+ lines) - Neural vocoder
+5. **s2mel/modules/audio.py** (83 lines) - Mel-spectrogram DSP
+### Priority 2: High Priority (Major Components)
+1. **conformer_encoder.py** (520 lines) - Speaker encoder
+2. **perceiver.py** (317 lines) - Attention pooling mechanism
+3. **maskgct_utils.py** (250 lines) - Semantic codec builders
+4. Various supporting modules for codec and transformer utilities
+### Priority 3: Medium Priority (Optimization & Utilities)
+1. Advanced transformer utilities
+2. Activation functions and filters
+3. Pitch extraction and flow matching
+4. Optional CUDA kernels for optimization
+## Technology Stack
+### Current (Python)
+- **Framework**: PyTorch (inference only)
+- **Text Processing**: SentencePiece, WeTextProcessing, regex
+- **Audio**: librosa, torchaudio, scipy
+- **Models**: HuggingFace Transformers
+- **Web UI**: Gradio
+### Pre-trained Models (6 Major)
+1. **IndexTTS-2** (~2GB) - Main TTS model
+2. **W2V-BERT-2.0** (~1GB) - Semantic features
+3. **MaskGCT** - Semantic codec
+4. **CAMPPlus** (~100MB) - Speaker embeddings
+5. **BigVGAN v2** (~100MB) - Vocoder
+6. **Qwen** (variable) - Emotion detection
+## File Organization
+### Core Modules
+- **indextts/gpt/** - GPT-based sequence generation (9 files, 16,953 lines)
+- **indextts/BigVGAN/** - Neural vocoder (6+ files, 1000+ lines)
+- **indextts/s2mel/** - Semantic-to-mel models (10+ files, 2000+ lines)
+- **indextts/utils/** - Text processing and utilities (12+ files, 500 lines)
+- **indextts/utils/maskgct/** - MaskGCT codecs (100+ files, 10000+ lines)
+### Interfaces
+- **webui.py** (18KB) - Gradio web interface
+- **cli.py** (64 lines) - Command-line interface
+- **infer.py/infer_v2.py** - Python API
+### Data & Config
+- **examples/** - Sample audio files and test cases
+- **tests/** - Regression and padding tests
+- **tools/** - Model downloading and i18n support
+## Detailed Documentation Generated
+Three comprehensive documents have been created and saved to the repository:
+1. **CODEBASE_ANALYSIS.md** (19 KB)
+   - Executive summary
+   - Complete project structure
+   - Current implementation details
+   - TTS pipeline explanation
+   - Algorithms and components breakdown
+   - Inference modes and capabilities
+   - Dependency conversion roadmap
+2. **DIRECTORY_STRUCTURE.txt** (14 KB)
+   - Complete file tree with annotations
+   - Files grouped by importance (⭐⭐⭐, ⭐⭐, ⭐)
+   - Line counts for each file
+   - Statistics summary
+3. **SOURCE_FILE_LISTING.txt** (23 KB)
+   - Detailed file-by-file breakdown
+   - Classes and methods for each major file
+   - Parameter specifications
+   - Algorithm descriptions
+   - Dependencies for each component
+## Key Technical Challenges for Rust Conversion
+### High Complexity
+1. **PyTorch Model Loading** - Need ONNX export or custom format
+2. **Complex Attention Mechanisms** - Transformers, Perceiver, Conformer
+3. **Text Normalization Libraries** - May need Rust bindings or reimplementation
+4. **Mel Spectrogram Computation** - STFT, mel filterbank calculations
+### Medium Complexity
+1. **Quantization & Codecs** - Multiple codec implementations to translate
+2. **Large Model Inference** - Optimization, batching, caching required
+3. **Audio DSP** - Resampling, filtering, spectral operations
+### Optimization (Optional)
+1. CUDA kernels for anti-aliased activations
+2. DeepSpeed integration for model parallelism
+3. KV cache for inference optimization
+## Recommended Rust Libraries
+| Component | Python Library | Rust Alternative |
+|---|---|---|
+| Model Inference | torch/transformers | **ort**, tch-rs, candle |
+| Audio Processing | librosa | rustfft, dasp_signal |
+| Text Tokenization | sentencepiece | sentencepiece (Rust binding) |
+| Numerical Computing | numpy | **ndarray**, nalgebra |
+| Chinese Text | jieba | **jieba-rs** |
+| Audio I/O | torchaudio | hound, wav |
+| Web Server | Gradio | **axum**, actix-web |
+| Config Files | OmegaConf YAML | **serde**, config-rs |
+| Model Format | safetensors | **safetensors-rs** |
+## Data Flow Example
+### Input
+- Text: "你好" (Chinese for "Hello")
+- Speaker Audio: "speaker.wav" (voice reference)
+- Emotion: "happy" (optional)
+### Processing Steps
+1. Text Normalization → "你好" (no change)
+2. Text Tokenization → [token_1, token_2, ...]
+3. Audio Loading & Mel-spectrogram computation
+4. W2V-BERT semantic embedding extraction
+5. Speaker feature extraction (CAMPPlus)
+6. Emotion vector generation
+7. GPT generation of mel-tokens
+8. Length regulation for acoustic codes
+9. BigVGAN vocoding
+10. Audio output at 22,050 Hz
+### Output
+- Waveform: "output.wav" (high-quality speech)
+## Test Coverage
+### Regression Tests Available
+- Chinese text with pinyin tones
+- English text
+- Mixed Chinese-English
+- Long-form text passages
+- Named entities (proper nouns)
+- Special punctuation handling
+## Performance Characteristics
+### Speed
+- Single inference: ~2-5 seconds per sentence (GPU)
+- Batch/fast inference: Parallel processing available
+- Caching: Speaker features and mel spectrograms are cached
+### Quality
+- 22,050 Hz sample rate (CD-quality audio)
+- 80-dimensional mel-spectrogram
+- 8-channel emotion control
+- Natural speech synthesis with speaker similarity
+### Model Parameters
+- GPT Model: 8 layers, 512 dims, 8 heads
+- Max text tokens: 120
+- Max mel tokens: 250
+- Mel spectrogram bins: 80
+- Emotion dimensions: 8
+## Next Steps for Rust Conversion
+### Phase 1: Foundation
+1. Set up Rust project structure
+2. Create model loading infrastructure (ONNX or binary format)
+3. Implement basic tensor operations using ndarray/candle
+### Phase 2: Core Pipeline
+1. Implement text normalization (regex + patterns)
+2. Implement SentencePiece tokenization
+3. Create mel-spectrogram DSP module
+4. Implement BigVGAN vocoder
+### Phase 3: Neural Components
+1. Implement transformer layers
+2. Implement Conformer encoder
+3. Implement Perceiver resampler
+4. Implement GPT generation
+### Phase 4: Integration
+1. Integrate all components
+2. Create CLI interface
+3. Create REST API or server interface
+4. Optimize and profile
+### Phase 5: Testing & Deployment
+1. Regression testing
+2. Performance benchmarking
+3. Documentation
+4. Deployment optimization
+## Summary Statistics
+- **Total Files Analyzed**: 194 Python files
+- **Total Lines of Code**: ~25,000+
+- **Architecture Depth**: 5 major pipeline stages
+- **External Models**: 6 HuggingFace models
+- **Languages Supported**: 2 (Chinese, English, with mixed support)
+- **Dimensions**: Text tokens, mel tokens, emotion vectors, speaker embeddings
+- **DSP Operations**: STFT, mel filterbanks, upsampling, convolution
+- **AI Techniques**: Transformers, Conformers, Perceiver pooling, diffusion-based generation
+## Conclusion
+IndexTTS is a **production-ready, state-of-the-art TTS system** with sophisticated architecture and multiple advanced features. The codebase is well-organized with clear separation of concerns, making it suitable for conversion to Rust. The main challenges will be:
+1. **Model Loading**: Handling PyTorch model weights in Rust
+2. **Text Processing**: Ensuring accuracy in pattern matching and normalization
+3. **Neural Architecture**: Correctly implementing complex attention mechanisms
+4. **Audio DSP**: Precise STFT and mel-spectrogram computation
+With careful planning and the right library selection, a full Rust conversion is feasible and would offer significant performance benefits and easier deployment.
+---
+## Documentation Files
+All analysis has been saved to the repository:
+- `CODEBASE_ANALYSIS.md` - Comprehensive technical analysis
+- `DIRECTORY_STRUCTURE.txt` - Complete file tree
+- `SOURCE_FILE_LISTING.txt` - Detailed component breakdown
+- `EXPLORATION_SUMMARY.md` - This file

SOURCE_FILE_LISTING.txt ADDED Viewed

	@@ -0,0 +1,513 @@

+╔════════════════════════════════════════════════════════════════════════════════╗
+║              DETAILED SOURCE FILE LISTING BY CATEGORY                          ║
+╚════════════════════════════════════════════════════════════════════════════════╝
+MAIN INFERENCE PIPELINE FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/indextts/infer_v2.py (739 LINES) ⭐⭐⭐ CRITICAL
+├─ Purpose: Main TTS inference class (IndexTTS2)
+├─ Key Classes:
+│  ├─ QwenEmotion (emotion text-to-vector conversion)
+│  ├─ IndexTTS2 (main inference class)
+│  └─ Helper functions for emotion/audio processing
+├─ Key Methods:
+│  ├─ __init__() - Initialize all models and codecs
+│  ├─ infer() - Single text generation with emotion control
+│  ├─ infer_fast() - Parallel segment generation
+│  ├─ get_emb() - Extract semantic embeddings
+│  ├─ remove_long_silence() - Silence token removal
+│  ├─ insert_interval_silence() - Silence insertion
+│  └─ Cache management for repeated generation
+├─ Models Loaded:
+│  ├─ UnifiedVoice (GPT model for mel token generation)
+│  ├─ W2V-BERT (semantic feature extraction)
+│  ├─ RepCodec (semantic codec)
+│  ├─ S2Mel model (semantic-to-mel conversion)
+│  ├─ CAMPPlus (speaker embedding)
+│  ├─ BigVGAN vocoder
+│  ├─ Qwen-based emotion model
+│  └─ Emotion/speaker matrices
+└─ External Dependencies: torch, transformers, librosa, safetensors
+/home/user/IndexTTS-Rust/webui.py (18KB) ⭐⭐⭐ WEB INTERFACE
+├─ Purpose: Gradio-based web UI for IndexTTS
+├─ Key Components:
+│  ├─ Model initialization (IndexTTS2 instance)
+│  ├─ Language selection (Chinese/English)
+│  ├─ Emotion control modes (4 modes)
+│  ├─ Example case loading from cases.jsonl
+│  ├─ Progress bar integration
+│  └─ Output management
+├─ Features:
+│  ├─ Real-time inference
+│  ├─ Multiple emotion control methods
+│  ├─ Batch processing
+│  ├─ Task caching
+│  ├─ i18n support
+│  └─ Pre-loaded example cases
+└─ Web Framework: Gradio 5.34.1
+/home/user/IndexTTS-Rust/indextts/cli.py (64 LINES)
+├─ Purpose: Command-line interface
+├─ Usage: python -m indextts.cli <text> -v <voice.wav> -o <output.wav> [options]
+├─ Arguments:
+│  ├─ text: Text to synthesize
+│  ├─ -v/--voice: Voice reference audio
+│  ├─ -o/--output_path: Output file path
+│  ├─ -c/--config: Config file path
+│  ├─ --model_dir: Model directory
+│  ├─ --fp16: Use FP16 precision
+│  ├─ -d/--device: Device (cpu/cuda/mps/xpu)
+│  └─ -f/--force: Force overwrite
+└─ Uses: IndexTTS (v1 model)
+TEXT PROCESSING & NORMALIZATION FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/indextts/utils/front.py (700 LINES) ⭐⭐⭐ CRITICAL
+├─ Purpose: Text normalization and tokenization
+├─ Key Classes:
+│  ├─ TextNormalizer (700+ lines)
+│  │  ├─ Pattern Definitions:
+│  │  │  ├─ PINYIN_TONE_PATTERN (regex for pinyin with tones 1-5)
+│  │  │  ├─ NAME_PATTERN (regex for Chinese names)
+│  │  │  └─ ENGLISH_CONTRACTION_PATTERN (regex for 's contractions)
+│  │  ├─ Methods:
+│  │  │  ├─ normalize() - Main normalization
+│  │  │  ├─ use_chinese() - Language detection
+│  │  │  ├─ save_pinyin_tones() - Extract pinyin with tones
+│  │  │  ├─ restore_pinyin_tones() - Restore pinyin
+│  │  │  ├─ save_names() - Extract names
+│  │  │  ├─ restore_names() - Restore names
+│  │  │  ├─ correct_pinyin() - Phoneme correction (jqx→v)
+│  │  │  └─ char_rep_map - Character replacement dictionary
+│  │  └─ Normalizers:
+│  │     ├─ zh_normalizer (Chinese) - Uses WeTextProcessing/wetext
+│  │     └─ en_normalizer (English) - Uses tn library
+│  │
+│  └─ TextTokenizer (200+ lines)
+│     ├─ Methods:
+│     │  ├─ encode() - Text to token IDs
+│     │  ├─ decode() - Token IDs to text
+│     │  ├─ convert_tokens_to_ids()
+│     │  ├─ convert_ids_to_tokens()
+│     │  └─ Vocab management
+│     ├─ Special Tokens:
+│     │  ├�� BOS: "<s>" (ID 0)
+│     │  ├─ EOS: "</s>" (ID 1)
+│     │  └─ UNK: "<unk>"
+│     └─ Tokenizer: SentencePiece (BPE-based)
+├─ Language Support:
+│  ├─ Chinese (simplified & traditional)
+│  ├─ English
+│  └─ Mixed Chinese-English
+└─ Critical Pattern Matching:
+   ├─ Pinyin tone detection
+   ├─ Name entity detection
+   ├─ Email matching
+   ├─ Character replacement
+   └─ Punctuation handling
+GPT MODEL ARCHITECTURE FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/indextts/gpt/model_v2.py (747 LINES) ⭐⭐⭐ CRITICAL
+├─ Purpose: UnifiedVoice GPT-based TTS model
+├─ Key Classes:
+│  ├─ UnifiedVoice (700+ lines)
+│  │  ├─ Architecture:
+│  │  │  ├─ Input Embeddings: Text (256 vocab), Mel (8194 vocab)
+│  │  │  ├─ Position Embeddings: Learned embeddings for mel/text
+│  │  │  ├─ GPT Transformer: Configurable layers/heads
+│  │  │  ├─ Conditioning Encoder: Conformer or Perceiver-based
+│  │  │  ├─ Emotion Conditioning: Separate conformer + perceiver
+│  │  │  └─ Output Heads: Text prediction, Mel prediction
+│  │  │
+│  │  ├─ Parameters:
+│  │  │  ├─ layers: 8 (transformer depth)
+│  │  │  ├─ model_dim: 512 (embedding dimension)
+│  │  │  ├─ heads: 8 (attention heads)
+│  │  │  ├─ max_text_tokens: 120
+│  │  │  ├─ max_mel_tokens: 250
+│  │  │  ├─ number_mel_codes: 8194
+│  │  │  ├─ condition_type: "conformer_perceiver" or "conformer_encoder"
+│  │  │  └─ Various activation functions
+│  │  │
+│  │  ├─ Key Methods:
+│  │  │  ├─ forward() - Forward pass
+│  │  │  ├─ post_init_gpt2_config() - Initialize for inference
+│  │  │  ├─ generate_mel() - Mel token generation
+│  │  │  ├─ forward_with_cond_scale() - With classifier-free guidance
+│  │  │  └─ Cache management
+│  │  │
+│  │  └─ Conditioning System:
+│  │     ├─ Speaker conditioning via mel spectrogram
+│  │     ├─ Conformer encoder for speaker features
+│  │     ├─ Perceiver for attention pooling
+│  │     ├─ Emotion conditioning (separate pathway)
+│  │     └─ Emotion vector support (8-dimensional)
+│  │
+│  ├─ ResBlock (40+ lines)
+│  │  ├─ Conv1d layers with GroupNorm
+│  │  └─ ReLU activation with residual connection
+│  │
+│  ├─ GPT2InferenceModel (200+ lines)
+│  │  ├─ Inference wrapper for GPT2
+│  │  ├─ KV cache support
+│  │  ├─ Model parallelism support
+│  │  └─ Token-by-token generation
+│  │
+│  ├─ ConditioningEncoder (30 lines)
+│  │  ├─ Conv1d initialization
+│  │  ├─ Attention blocks
+│  │  └─ Optional mean pooling
+│  │
+│  ├─ MelEncoder (30 lines)
+│  │  ├─ Conv1d layers
+│  │  ├─ ResBlocks
+│  │  └─ 4x reduction
+│  │
+│  ├─ LearnedPositionEmbeddings (15 lines)
+│  │  └─ Learnable positional embeddings
+│  │
+│  └─ build_hf_gpt_transformer() (20 lines)
+│     └─ Builds HuggingFace GPT2 with custom embeddings
+│
+├─ External Dependencies: torch, transformers, indextts.gpt modules
+└─ Critical Inference Parameters:
+   ├─ Temperature control for generation
+   ├─ Top-k/top-p sampling
+   ├─ Classifier-free guidance scale
+   └─ Generation length limits
+/home/user/IndexTTS-Rust/indextts/gpt/conformer_encoder.py (520 LINES) ⭐⭐
+├─ Purpose: Conformer-based speaker conditioning encoder
+├─ Key Classes:
+│  ├─ ConformerEncoder (main)
+│  │  ├─ Modules:
+│  │  │  ├─ Subsampling layer (Conv2d)
+│  │  │  ├─ Positional encoding
+│  │  │  ├─ Conformer blocks
+│  │  │  ├─ Layer normalization
+│  │  │  └─ Optional projection layer
+│  │  │
+│  │  ├─ Configuration Parameters:
+│  │  │  ├─ input_size: 1024 (mel spectrogram bins)
+│  │  │  ├─ output_size: depends on config
+│  │  │  ├─ linear_units: hidden dim for FFN
+│  │  │  ├─ attention_heads: 8
+│  │  │  ├─ num_blocks: 4
+│  │  │  └─ input_layer: "linear" or "conv2d"
+│  │  │
+│  │  └─ Architecture: Conv → Pos Enc → [Conformer Block] * N → LayerNorm
+│  │
+│  ├─ ConformerBlock (80+ lines)
+│  │  ├─ Residual connections
+│  │  ├─ FFN → Attention → Conv → FFN structure
+│  │  ├─ Feed-forward network (2-layer with dropout)
+│  │  ├─ Multi-head self-attention
+│  │  ├─ Convolution module (depthwise)
+│  │  └─ Layer normalization
+│  │
+│  ├─ ConvolutionModule (50 lines)
+│  │  ├─ Pointwise Conv 1x1
+│  │  ├─ Depthwise Conv with kernel_size (e.g., 15)
+│  │  ├─ Batch normalization or layer normalization
+│  │  ├─ Activation (ReLU/SiLU)
+│  │  └─ Projection
+│  │
+│  ├─ PositionwiseFeedForward (15 lines)
+│  │  ├─ Dense layer (idim → hidden)
+│  │  ├─ Activation (ReLU)
+│  │  ├─ Dropout
+│  │  └─ Dense layer (hidden → idim)
+│  │
+│  └─ MultiHeadedAttention (custom)
+│     ├─ Scaled dot-product attention
+│     ├─ Multiple heads
+│     └─ Optional relative position bias
+│
+├─ External Dependencies: torch, custom conformer modules
+└─ Use Case: Processing mel spectrogram to extract speaker features
+/home/user/IndexTTS-Rust/indextts/gpt/perceiver.py (317 LINES) ⭐⭐
+├─ Purpose: Perceiver resampler for attention pooling
+├─ Key Classes:
+│  ├─ PerceiverResampler (250+ lines)
+│  │  ├─ Architecture:
+│  │  │  ├─ Learnable latent queries
+│  │  │  ├─ Cross-attention layers
+│  │  │  ├─ Feed-forward networks
+│  │  │  └─ Layer normalization
+│  │  │
+│  │  ├─ Parameters:
+│  │  │  ├─ dim: 512 (embedding dimension)
+│  │  │  ├─ dim_context: 512 (context dimension)
+│  │  │  ├─ num_latents: 32 (number of latent queries)
+│  │  │  ├─ num_latent_channels: 64
+│  │  │  ├─ num_layers: 6
+│  │  │  ├─ ff_mult: 4 (FFN expansion)
+│  │  │  └─ heads: 8
+│  │  │
+│  │  ├─ Key Methods:
+│  │  │  ├─ forward() - Attend and pool
+│  │  │  └─ _cross_attend_block() - Single cross-attention layer
+│  │  │
+│  │  └─ Cross-Attention Mechanism:
+│  │     ├─ Queries: Learnable latents
+│  │     ├─ Keys/Values: Input context
+│  │     ├─ Output: Pooled features (num_latents × dim)
+│  │     └─ FFN projection for dimension mixing
+│  │
+│  └─ FeedForward (15 lines)
+│     ├─ Dense (dim → hidden)
+│     ├─ GELU activation
+│     └─ Dense (hidden → dim)
+│
+├─ External Dependencies: torch, einsum operations
+└─ Use Case: Pool conditioning encoder output to fixed-size representation
+VOCODER & AUDIO SYNTHESIS FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/indextts/BigVGAN/models.py (1000+ LINES) ⭐⭐⭐
+├─ Purpose: BigVGAN neural vocoder for mel-to-audio conversion
+├─ Key Classes:
+│  ├─ BigVGAN (400+ lines)
+│  │  ├─ Architecture:
+│  │  │  ├─ Initial Conv1d (80 mel bins → 192 channels)
+│  │  │  ├─ Upsampling layers (transposed conv)
+│  │  │  ├─ AMP blocks (anti-aliased multi-period)
+│  │  │  ├─ Final Conv1d (channels → 1 waveform)
+│  │  │  └─ Tanh activation for output
+│  │  │
+│  │  ├─ Upsampling: 4x → 8x → 8x → 4x (256x total)
+│  │  │  ├─ Maps from 22050 Hz mel frames to audio samples
+│  │  │  ├─ Kernel sizes: [16, 16, 4, 4]
+│  │  │  └─ Padding: [6, 6, 2, 2]
+│  │  │
+│  │  ├─ Parameters:
+│  │  │  ├─ num_mels: 80
+│  │  │  ├─ num_freq: 513
+│  │  │  ├─ num_mels: 80
+│  │  │  ├─ n_fft: 1024
+│  │  │  ├─ hop_size: 256
+│  │  │  ├─ win_size: 1024
+│  │  │  ├─ sampling_rate: 22050
+│  │  │  ├─ freq_min: 0
+│  │  │  ├─ freq_max: None
+│  │  │  └─ use_cuda_kernel: bool
+│  │  │
+│  │  ├─ Key Methods:
+│  │  │  ├─ forward() - Mel → audio waveform
+│  │  │  ├─ from_pretrained() - Load from HuggingFace
+│  │  │  ├─ remove_weight_norm() - Remove spectral normalization
+│  │  │  └─ eval() - Set to evaluation mode
+│  │  │
+│  │  └─ Special Features:
+│  │     ├─ Weight normalization for training stability
+│  │     ├─ Spectral normalization option
+│  │     ├─ CUDA kernel support for activation functions
+│  │     ├─ Snake/SnakeBeta activation (periodic)
+│  │     └─ Anti-aliasing filters for high-quality upsampling
+│  │
+│  ├─ AMPBlock1 (50 lines)
+│  │  ├─ Architecture: Conv1d × 2 with activations
+│  │  ├─ Multiple dilation patterns [1, 3, 5]
+│  │  ├─ Residual connections
+│  │  ├─ Activation1d wrapper for anti-aliasing
+│  │  └─ Weight normalization
+│  │
+│  ├─ AMPBlock2 (40 lines)
+│  │  ├─ Similar to AMPBlock1 but simpler
+│  │  ├─ Dilation patterns [1, 3]
+│  │  └─ Residual connections
+│  │
+│  ├─ Activation1d (custom, from alias_free_activation/)
+│  │  ├─ Applies activation function (Snake/SnakeBeta)
+│  │  ├─ Optional anti-aliasing filter
+│  │  └─ Optional CUDA kernel for efficiency
+│  │
+│  ├─ Snake Activation (from activations.py)
+│  │  ├─ Formula: x + (1/alpha) * sin²(alpha * x)
+│  │  ├─ Periodic nonlinearity
+│  │  └─ Learnable alpha parameter
+│  │
+│  └─ SnakeBeta Activation (from activations.py)
+│     ├─ More complex periodic activation
+│     └─ Improved harmonic modeling
+│
+├─ External Dependencies: torch, scipy, librosa
+└─ Model Size: ~100 MB (pretrained weights)
+/home/user/IndexTTS-Rust/indextts/s2mel/modules/audio.py (83 LINES)
+├─ Purpose: Mel-spectrogram computation (DSP)
+├─ Key Functions:
+│  ├─ load_wav() - Load WAV file with scipy
+│  ├─ mel_spectrogram() - Compute mel spectrogram
+│  │  ├─ Parameters:
+│  │  │  ├─ y: waveform tensor
+│  │  │  ├─ n_fft: 1024
+│  │  │  ├─ num_mels: 80
+│  │  │  ├─ sampling_rate: 22050
+│  │  │  ├─ hop_size: 256
+│  │  │  ├─ win_size: 1024
+│  │  │  ├─ fmin: 0
+│  │  │  └─ fmax: None or 8000
+│  │  │
+│  │  ├─ Process:
+│  │  │  1. Pad input with reflect padding
+│  │  │  2. Compute STFT (Short-Time Fourier Transform)
+│  │  │  3. Convert to magnitude spectrogram
+│  │  │  4. Apply mel filterbank (librosa)
+│  │  │  5. Apply dynamic range compression (log)
+│  │  │  └─ Output: [1, 80, T] tensor
+│  │  │
+│  │  └─ Caching:
+│  │     ├─ Caches mel filterbank matrices
+│  │     ├─ Caches Hann windows
+│  │     └─ Device-specific caching
+│  │
+│  ├─ dynamic_range_compression() - Log compression
+│  ├─ dynamic_range_decompression() - Inverse
+│  └─ spectral_normalize/denormalize()
+│
+├─ Critical DSP Parameters:
+│  ├─ STFT Window: Hann window
+│  ├─ FFT Size: 1024
+│  ├─ Hop Size: 256 (11.6 ms at 22050 Hz)
+│  ├─ Mel Bins: 80 (perceptual scale)
+│  ├─ Min Freq: 0 Hz
+│  └─ Max Freq: Variable (8000 Hz or Nyquist)
+│
+└─ External Dependencies: torch, librosa, scipy
+SEMANTIC CODEC & FEATURE EXTRACTION FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/indextts/utils/maskgct_utils.py (250 LINES)
+├─ Purpose: Build and manage semantic codecs
+├─ Key Functions:
+│  ├─ build_semantic_model()
+│  │  ├─ Loads: facebook/w2v-bert-2.0 model
+│  │  ├─ Extracts: wav2vec 2.0 BERT embeddings
+│  │  ├─ Returns: model, mean, std (for normalization)
+│  │  └─ Output: 1024-dimensional embeddings
+│  │
+│  ├─ build_semantic_codec()
+│  │  ├─ Creates: RepCodec (residual vector quantization)
+│  │  ├─ Quantizes: Semantic embeddings
+│  │  ├─ Returns: Codec model
+│  │  └─ Output: Discrete tokens
+│  │
+│  ├─ build_s2a_model()
+│  │  ├─ Builds: MaskGCT_S2A (semantic-to-acoustic)
+│  │  └─ Maps: Semantic codes → acoustic codes
+│  │
+│  ├─ build_acoustic_codec()
+│  │  ├─ Encoder: Encodes acoustic features
+│  │  ├─ Decoder: Decodes codes → audio
+│  │  └─ Multiple codec variants
+│  │
+│  └─ Inference_Pipeline (class)
+│     ├─ Combines all codecs
+│     ├─ Methods:
+│     │  ├─ get_emb() - Get semantic embeddings
+│     │  ├─ get_scode() - Quantize to semantic codes
+│     │  ├─ semantic2acoustic() - Convert codes
+│     │  └─ s2a_inference() - Full pipeline
+│     └─ Diffusion-based generation options
+│
+├─ External Dependencies: torch, transformers, huggingface_hub
+└─ Pre-trained Models:
+   ├─ W2V-BERT-2.0: 614M parameters
+   ├─ MaskGCT: From amphion/MaskGCT
+   └─ Various codec checkpoints
+CONFIGURATION & UTILITY FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/indextts/utils/checkpoint.py (50 LINES)
+├─ Purpose: Load model checkpoints
+├─ Key Functions:
+│  ├─ load_checkpoint() - Load weights into model
+│  └─ Device handling (CPU/GPU/XPU/MPS)
+└─ Supported Formats: .pth, .safetensors
+/home/user/IndexTTS-Rust/indextts/utils/arch_util.py
+├─ Purpose: Architecture utility modules
+├─ Key Classes:
+│  └─ AttentionBlock - Generic attention layer
+└─ Used in: Conditioning encoder, other modules
+/home/user/IndexTTS-Rust/indextts/utils/xtransformers.py (1,600 LINES)
+├─ Purpose: Extended transformer utilities
+├─ Key Components:
+│  ├─ Advanced attention mechanisms
+│  ├─ Relative position bias
+│  ├─ Cross-attention patterns
+│  └─ Various position encoding schemes
+└─ Used in: GPT model, encoders
+TESTING FILES
+═════════════════════════════════════════════════════════════════════════════════
+/home/user/IndexTTS-Rust/tests/regression_test.py
+├─ Test Cases:
+│  ├─ Chinese text with pinyin tones (晕 XUAN4)
+│  ├─ English text
+│  ├─ Mixed Chinese-English
+│  ├─ Long-form text with multiple sentences
+│  ├─ Named entities (Joseph Gordon-Levitt)
+│  ├─ Chinese names (约瑟夫·高登-莱维特)
+│  └─ Extended passages for robustness
+├─ Inference Modes:
+│  ├─ Single inference (infer)
+│  └─ Fast inference (infer_fast)
+└─ Output: WAV files in outputs/ directory
+/home/user/IndexTTS-Rust/tests/padding_test.py
+├─ Test Scenarios:
+│  ├─ Variable length inputs
+│  ├─ Batch processing
+│  ├─ Edge cases
+│  └─ Padding handling
+└─ Purpose: Ensure robust padding mechanics
+═════════════════════════════════════════════════════════════════════════════════
+KEY ALGORITHMS SUMMARY:
+1. TEXT PROCESSING:
+   - Regex-based pattern matching for pinyin/names
+   - Character-level CJK tokenization
+   - SentencePiece BPE encoding
+   - Language detection (Chinese vs English)
+2. FEATURE EXTRACTION:
+   - W2V-BERT semantic embeddings (1024-dim)
+   - RepCodec quantization
+   - Mel-spectrogram (STFT-based, 80-dim)
+   - CAMPPlus speaker embeddings (192-dim)
+3. SEQUENCE GENERATION:
+   - GPT-based autoregressive generation
+   - Conformer speaker conditioning
+   - Perceiver pooling for attention
+   - Classifier-free guidance (optional)
+   - Temperature/top-k/top-p sampling
+4. AUDIO SYNTHESIS:
+   - Transposed convolution upsampling (256x)
+   - Anti-aliased activation functions
+   - Residual connections
+   - Weight/spectral normalization
+5. EMOTION CONTROL:
+   - 8-dimensional emotion vectors
+   - Text-based emotion detection (via Qwen)
+   - Audio-based emotion extraction
+   - Emotion matrix interpolation
+═════════════════════════════════════════════════════════════════════════════════