IndexTTS-Rust / DIRECTORY_STRUCTURE.txt
Claude
Add codebase analysis documentation and update gitignore
b48d7b7 unverified
IndexTTS-Rust/ (Complete Directory Structure)
β”‚
β”œβ”€β”€ indextts/ # Main Python package (194 files)
β”‚ β”‚
β”‚ β”œβ”€β”€ __init__.py # Package initialization
β”‚ β”œβ”€β”€ cli.py # Command-line interface (64 lines)
β”‚ β”œβ”€β”€ infer.py # Original inference (v1) - 690 lines
β”‚ β”œβ”€β”€ infer_v2.py # Main inference v2 - 739 lines ⭐⭐⭐
β”‚ β”‚
β”‚ β”œβ”€β”€ gpt/ # GPT-based TTS model (9 files, 16,953 lines)
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ model.py # Original UnifiedVoice (713L)
β”‚ β”‚ β”œβ”€β”€ model_v2.py # UnifiedVoice v2 ⭐⭐⭐ (747L)
β”‚ β”‚ β”œβ”€β”€ conformer_encoder.py # Conformer encoder ⭐⭐ (520L)
β”‚ β”‚ β”œβ”€β”€ perceiver.py # Perceiver resampler (317L)
β”‚ β”‚ β”œβ”€β”€ conformer_encoder.py # Conformer components
β”‚ β”‚ β”œβ”€β”€ transformers_gpt2.py # GPT2 implementation (1,878L)
β”‚ β”‚ β”œβ”€β”€ transformers_generation_utils.py # Generation utilities (4,747L)
β”‚ β”‚ β”œβ”€β”€ transformers_beam_search.py # Beam search (1,013L)
β”‚ β”‚ └── transformers_modeling_utils.py # Model utilities (5,525L)
β”‚ β”‚
β”‚ β”œβ”€β”€ BigVGAN/ # Neural Vocoder (6+ files, ~1000+ lines)
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ models.py # BigVGAN architecture ⭐⭐⭐
β”‚ β”‚ β”œβ”€β”€ ECAPA_TDNN.py # Speaker encoder
β”‚ β”‚ β”œβ”€β”€ activations.py # Snake, SnakeBeta activations
β”‚ β”‚ β”œβ”€β”€ utils.py # Helper functions
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€β”€ alias_free_activation/ # CUDA kernel variants
β”‚ β”‚ β”‚ β”œβ”€β”€ cuda/
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ activation1d.py # CUDA kernel loader
β”‚ β”‚ β”‚ β”‚ └── load.py
β”‚ β”‚ β”‚ └── torch/
β”‚ β”‚ β”‚ β”œβ”€β”€ act.py # PyTorch activation
β”‚ β”‚ β”‚ β”œβ”€β”€ filter.py # Anti-aliasing filter
β”‚ β”‚ β”‚ └── resample.py # Resampling
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€β”€ alias_free_torch/ # PyTorch-only fallback
β”‚ β”‚ β”‚ β”œβ”€β”€ act.py
β”‚ β”‚ β”‚ β”œβ”€β”€ filter.py
β”‚ β”‚ β”‚ └── resample.py
β”‚ β”‚ β”‚
β”‚ β”‚ └── nnet/ # Network modules
β”‚ β”‚ β”œβ”€β”€ linear.py
β”‚ β”‚ β”œβ”€β”€ normalization.py
β”‚ β”‚ └── CNN.py
β”‚ β”‚
β”‚ β”œβ”€β”€ s2mel/ # Semantic-to-Mel Models (~500+ lines)
β”‚ β”‚ β”œβ”€β”€ modules/ # Core modules (10+ files)
β”‚ β”‚ β”‚ β”œβ”€β”€ audio.py # Mel-spectrogram computation ⭐
β”‚ β”‚ β”‚ β”œβ”€β”€ commons.py # Common utilities (21KB)
β”‚ β”‚ β”‚ β”œβ”€β”€ layers.py # NN layers (13KB)
β”‚ β”‚ β”‚ β”œβ”€β”€ length_regulator.py # Duration modeling
β”‚ β”‚ β”‚ β”œβ”€β”€ flow_matching.py # Continuous flow matching
β”‚ β”‚ β”‚ β”œβ”€β”€ diffusion_transformer.py # Diffusion model
β”‚ β”‚ β”‚ β”œβ”€β”€ rmvpe.py # Pitch extraction (22KB)
β”‚ β”‚ β”‚ β”œβ”€β”€ quantize.py # Quantization
β”‚ β”‚ β”‚ β”œβ”€β”€ encodec.py # EnCodec codec
β”‚ β”‚ β”‚ β”œβ”€β”€ wavenet.py # WaveNet implementation
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”œβ”€β”€ bigvgan/ # BigVGAN vocoder
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ modules.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ config.json
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ bigvgan.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ alias_free_activation/ # Variants
β”‚ β”‚ β”‚ β”‚ └── models.py
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”œβ”€β”€ vocos/ # Vocos codec
β”‚ β”‚ β”‚ β”œβ”€β”€ hifigan/ # HiFiGAN vocoder
β”‚ β”‚ β”‚ β”œβ”€β”€ openvoice/ # OpenVoice components (11 files)
β”‚ β”‚ β”‚ β”œβ”€β”€ campplus/ # CAMPPlus speaker encoder
β”‚ β”‚ β”‚ β”‚ └── DTDNN.py # DTDNN architecture
β”‚ β”‚ β”‚ └── gpt_fast/ # Fast GPT inference
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€β”€ dac/ # DAC codec
β”‚ β”‚ β”‚ β”œβ”€β”€ model/
β”‚ β”‚ β”‚ β”œβ”€β”€ nn/
β”‚ β”‚ β”‚ └── utils/
β”‚ β”‚ β”‚
β”‚ β”‚ └── (other s2mel implementations)
β”‚ β”‚
β”‚ β”œβ”€β”€ utils/ # Text & Feature Utils (12+ files, ~500L)
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ front.py # TextNormalizer, TextTokenizer ⭐⭐⭐ (700L)
β”‚ β”‚ β”œβ”€β”€ maskgct_utils.py # Semantic codec builders (250L)
β”‚ β”‚ β”œβ”€β”€ arch_util.py # AttentionBlock, utilities
β”‚ β”‚ β”œβ”€β”€ checkpoint.py # Model loading
β”‚ β”‚ β”œβ”€β”€ xtransformers.py # Transformer utils (1,600L)
β”‚ β”‚ β”œβ”€β”€ feature_extractors.py # MelSpectrogramFeatures
β”‚ β”‚ β”œβ”€β”€ common.py # Common functions
β”‚ β”‚ β”œβ”€β”€ text_utils.py # Text utilities
β”‚ β”‚ β”œβ”€β”€ typical_sampling.py # TypicalLogitsWarper sampling
β”‚ β”‚ β”œβ”€β”€ utils.py # General utils
β”‚ β”‚ β”œβ”€β”€ webui_utils.py # Web UI helpers
β”‚ β”‚ β”œβ”€β”€ tagger_cache/ # Text normalization cache
β”‚ β”‚ β”‚
β”‚ β”‚ └── maskgct/ # MaskGCT codec (100+ files, 10KB+)
β”‚ β”‚ └── models/
β”‚ β”‚ β”œβ”€β”€ codec/ # Multiple codec implementations
β”‚ β”‚ β”‚ β”œβ”€β”€ amphion_codec/ # Amphion codec
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ codec.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ vocos.py
β”‚ β”‚ β”‚ β”‚ └── quantize/ # Quantization
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ vector_quantize.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ residual_vq.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ factorized_vector_quantize.py
β”‚ β”‚ β”‚ β”‚ └── lookup_free_quantize.py
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”œβ”€β”€ facodec/ # FACodec variant
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ facodec_inference.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ modules/
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ commons.py
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ attentions.py
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ layers.py
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ quantize.py
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ wavenet.py
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ style_encoder.py
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ gradient_reversal.py
β”‚ β”‚ β”‚ β”‚ β”‚ └── JDC/ (pitch detection)
β”‚ β”‚ β”‚ β”‚ └── alias_free_torch/ # Anti-aliasing
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”œβ”€β”€ speechtokenizer/ # Speech Tokenizer codec
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ model.py
β”‚ β”‚ β”‚ β”‚ └── modules/
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ seanet.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ lstm.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ norm.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ conv.py
β”‚ β”‚ β”‚ β”‚ └── quantization/
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”œβ”€β”€ ns3_codec/ # NS3 codec variant
β”‚ β”‚ β”‚ β”œβ”€β”€ vevo/ # VEVo codec
β”‚ β”‚ β”‚ β”œβ”€β”€ kmeans/ # KMeans codec
β”‚ β”‚ β”‚ β”œβ”€β”€ melvqgan/ # MelVQ-GAN codec
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”œβ”€β”€ codec_inference.py
β”‚ β”‚ β”‚ β”œβ”€β”€ codec_sampler.py
β”‚ β”‚ β”‚ β”œβ”€β”€ codec_trainer.py
β”‚ β”‚ β”‚ └── codec_dataset.py
β”‚ β”‚ β”‚
β”‚ β”‚ └── tts/
β”‚ β”‚ └── maskgct/
β”‚ β”‚ β”œβ”€β”€ maskgct_s2a.py # Semantic-to-acoustic
β”‚ β”‚ └── ckpt/
β”‚ β”‚
β”‚ └── vqvae/ # Vector Quantized VAE
β”‚ β”œβ”€β”€ xtts_dvae.py # Discrete VAE (currently disabled)
β”‚ └── (other VAE components)
β”‚
β”œβ”€β”€ examples/ # Sample Data & Test Cases
β”‚ β”œβ”€β”€ cases.jsonl # Example test cases
β”‚ β”œβ”€β”€ voice_*.wav # Sample voice prompts (12 files)
β”‚ β”œβ”€β”€ emo_*.wav # Emotion reference samples (2 files)
β”‚ └── sample_prompt.wav # Default prompt (implied)
β”‚
β”œβ”€β”€ tests/ # Test Suite
β”‚ β”œβ”€β”€ regression_test.py # Main regression tests ⭐
β”‚ └── padding_test.py # Padding/batch tests
β”‚
β”œβ”€β”€ tools/ # Utility Scripts & i18n
β”‚ β”œβ”€β”€ download_files.py # Model downloading from HF
β”‚ └── i18n/ # Internationalization
β”‚ β”œβ”€β”€ i18n.py # Translation system
β”‚ β”œβ”€β”€ scan_i18n.py # i18n scanner
β”‚ └── locale/
β”‚ β”œβ”€β”€ en_US.json # English translations
β”‚ └── zh_CN.json # Chinese translations
β”‚
β”œβ”€β”€ archive/ # Historical Docs
β”‚ └── README_INDEXTTS_1_5.md # IndexTTS 1.5 documentation
β”‚
β”œβ”€β”€ webui.py # Gradio Web UI ⭐⭐⭐ (18KB)
β”œβ”€β”€ cli.py # Command-line interface
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ MANIFEST.in # Package manifest
β”œβ”€β”€ .gitignore # Git ignore rules
β”œβ”€β”€ .gitattributes # Git attributes
└── LICENSE # Apache 2.0 License
═══════════════════════════════════════════════════════════════════════════════
KEY FILES BY IMPORTANCE:
═══════════════════════════════════════════════════════════════════════════════
⭐⭐⭐ CRITICAL (Core Logic - MUST Convert First)
1. indextts/infer_v2.py - Main inference pipeline (739L)
2. indextts/gpt/model_v2.py - UnifiedVoice GPT model (747L)
3. indextts/utils/front.py - Text processing (700L)
4. indextts/BigVGAN/models.py - Vocoder (1000+L)
5. indextts/s2mel/modules/audio.py - Mel-spectrogram (83L, critical DSP)
⭐⭐ HIGH PRIORITY (Major Components)
1. indextts/gpt/conformer_encoder.py - Conformer blocks (520L)
2. indextts/gpt/perceiver.py - Perceiver attention (317L)
3. indextts/utils/maskgct_utils.py - Codec builders (250L)
4. indextts/s2mel/modules/commons.py - Common utilities (21KB)
⭐ MEDIUM PRIORITY (Utilities & Optimization)
1. indextts/utils/xtransformers.py - Transformer utils (1,600L)
2. indextts/BigVGAN/activations.py - Activation functions
3. indextts/s2mel/modules/rmvpe.py - Pitch extraction (22KB)
OPTIONAL (Web UI, Tools)
1. webui.py - Gradio interface
2. tools/download_files.py - Model downloading
═══════════════════════════════════════════════════════════════════════════════
TOTAL STATISTICS:
═══════════════════════════════════════════════════════════════════════════════
Total Python Files: 194
Total Lines of Code: ~25,000+
GPT Module: 16,953 lines
MaskGCT Codecs: ~10,000+ lines
S2Mel Models: ~2,000+ lines
BigVGAN: ~1,000+ lines
Utils: ~500 lines
Tests: ~100 lines
Models Supported: 6 major HuggingFace models
Languages: Chinese (full), English (full), Mixed
Emotion Dimensions: 8-dimensional emotion control
Audio Sample Rate: 22,050 Hz (primary)
Max Text Tokens: 120
Max Mel Tokens: 250
Mel Spectrogram Bins: 80