| IndexTTS-Rust/ (Complete Directory Structure) | |
| β | |
| βββ indextts/ # Main Python package (194 files) | |
| β β | |
| β βββ __init__.py # Package initialization | |
| β βββ cli.py # Command-line interface (64 lines) | |
| β βββ infer.py # Original inference (v1) - 690 lines | |
| β βββ infer_v2.py # Main inference v2 - 739 lines βββ | |
| β β | |
| β βββ gpt/ # GPT-based TTS model (9 files, 16,953 lines) | |
| β β βββ __init__.py | |
| β β βββ model.py # Original UnifiedVoice (713L) | |
| β β βββ model_v2.py # UnifiedVoice v2 βββ (747L) | |
| β β βββ conformer_encoder.py # Conformer encoder ββ (520L) | |
| β β βββ perceiver.py # Perceiver resampler (317L) | |
| β β βββ conformer_encoder.py # Conformer components | |
| β β βββ transformers_gpt2.py # GPT2 implementation (1,878L) | |
| β β βββ transformers_generation_utils.py # Generation utilities (4,747L) | |
| β β βββ transformers_beam_search.py # Beam search (1,013L) | |
| β β βββ transformers_modeling_utils.py # Model utilities (5,525L) | |
| β β | |
| β βββ BigVGAN/ # Neural Vocoder (6+ files, ~1000+ lines) | |
| β β βββ __init__.py | |
| β β βββ models.py # BigVGAN architecture βββ | |
| β β βββ ECAPA_TDNN.py # Speaker encoder | |
| β β βββ activations.py # Snake, SnakeBeta activations | |
| β β βββ utils.py # Helper functions | |
| β β β | |
| β β βββ alias_free_activation/ # CUDA kernel variants | |
| β β β βββ cuda/ | |
| β β β β βββ activation1d.py # CUDA kernel loader | |
| β β β β βββ load.py | |
| β β β βββ torch/ | |
| β β β βββ act.py # PyTorch activation | |
| β β β βββ filter.py # Anti-aliasing filter | |
| β β β βββ resample.py # Resampling | |
| β β β | |
| β β βββ alias_free_torch/ # PyTorch-only fallback | |
| β β β βββ act.py | |
| β β β βββ filter.py | |
| β β β βββ resample.py | |
| β β β | |
| β β βββ nnet/ # Network modules | |
| β β βββ linear.py | |
| β β βββ normalization.py | |
| β β βββ CNN.py | |
| β β | |
| β βββ s2mel/ # Semantic-to-Mel Models (~500+ lines) | |
| β β βββ modules/ # Core modules (10+ files) | |
| β β β βββ audio.py # Mel-spectrogram computation β | |
| β β β βββ commons.py # Common utilities (21KB) | |
| β β β βββ layers.py # NN layers (13KB) | |
| β β β βββ length_regulator.py # Duration modeling | |
| β β β βββ flow_matching.py # Continuous flow matching | |
| β β β βββ diffusion_transformer.py # Diffusion model | |
| β β β βββ rmvpe.py # Pitch extraction (22KB) | |
| β β β βββ quantize.py # Quantization | |
| β β β βββ encodec.py # EnCodec codec | |
| β β β βββ wavenet.py # WaveNet implementation | |
| β β β β | |
| β β β βββ bigvgan/ # BigVGAN vocoder | |
| β β β β βββ modules.py | |
| β β β β βββ config.json | |
| β β β β βββ bigvgan.py | |
| β β β β βββ alias_free_activation/ # Variants | |
| β β β β βββ models.py | |
| β β β β | |
| β β β βββ vocos/ # Vocos codec | |
| β β β βββ hifigan/ # HiFiGAN vocoder | |
| β β β βββ openvoice/ # OpenVoice components (11 files) | |
| β β β βββ campplus/ # CAMPPlus speaker encoder | |
| β β β β βββ DTDNN.py # DTDNN architecture | |
| β β β βββ gpt_fast/ # Fast GPT inference | |
| β β β | |
| β β βββ dac/ # DAC codec | |
| β β β βββ model/ | |
| β β β βββ nn/ | |
| β β β βββ utils/ | |
| β β β | |
| β β βββ (other s2mel implementations) | |
| β β | |
| β βββ utils/ # Text & Feature Utils (12+ files, ~500L) | |
| β β βββ __init__.py | |
| β β βββ front.py # TextNormalizer, TextTokenizer βββ (700L) | |
| β β βββ maskgct_utils.py # Semantic codec builders (250L) | |
| β β βββ arch_util.py # AttentionBlock, utilities | |
| β β βββ checkpoint.py # Model loading | |
| β β βββ xtransformers.py # Transformer utils (1,600L) | |
| β β βββ feature_extractors.py # MelSpectrogramFeatures | |
| β β βββ common.py # Common functions | |
| β β βββ text_utils.py # Text utilities | |
| β β βββ typical_sampling.py # TypicalLogitsWarper sampling | |
| β β βββ utils.py # General utils | |
| β β βββ webui_utils.py # Web UI helpers | |
| β β βββ tagger_cache/ # Text normalization cache | |
| β β β | |
| β β βββ maskgct/ # MaskGCT codec (100+ files, 10KB+) | |
| β β βββ models/ | |
| β β βββ codec/ # Multiple codec implementations | |
| β β β βββ amphion_codec/ # Amphion codec | |
| β β β β βββ codec.py | |
| β β β β βββ vocos.py | |
| β β β β βββ quantize/ # Quantization | |
| β β β β βββ vector_quantize.py | |
| β β β β βββ residual_vq.py | |
| β β β β βββ factorized_vector_quantize.py | |
| β β β β βββ lookup_free_quantize.py | |
| β β β β | |
| β β β βββ facodec/ # FACodec variant | |
| β β β β βββ facodec_inference.py | |
| β β β β βββ modules/ | |
| β β β β β βββ commons.py | |
| β β β β β βββ attentions.py | |
| β β β β β βββ layers.py | |
| β β β β β βββ quantize.py | |
| β β β β β βββ wavenet.py | |
| β β β β β βββ style_encoder.py | |
| β β β β β βββ gradient_reversal.py | |
| β β β β β βββ JDC/ (pitch detection) | |
| β β β β βββ alias_free_torch/ # Anti-aliasing | |
| β β β β | |
| β β β βββ speechtokenizer/ # Speech Tokenizer codec | |
| β β β β βββ model.py | |
| β β β β βββ modules/ | |
| β β β β βββ seanet.py | |
| β β β β βββ lstm.py | |
| β β β β βββ norm.py | |
| β β β β βββ conv.py | |
| β β β β βββ quantization/ | |
| β β β β | |
| β β β βββ ns3_codec/ # NS3 codec variant | |
| β β β βββ vevo/ # VEVo codec | |
| β β β βββ kmeans/ # KMeans codec | |
| β β β βββ melvqgan/ # MelVQ-GAN codec | |
| β β β β | |
| β β β βββ codec_inference.py | |
| β β β βββ codec_sampler.py | |
| β β β βββ codec_trainer.py | |
| β β β βββ codec_dataset.py | |
| β β β | |
| β β βββ tts/ | |
| β β βββ maskgct/ | |
| β β βββ maskgct_s2a.py # Semantic-to-acoustic | |
| β β βββ ckpt/ | |
| β β | |
| β βββ vqvae/ # Vector Quantized VAE | |
| β βββ xtts_dvae.py # Discrete VAE (currently disabled) | |
| β βββ (other VAE components) | |
| β | |
| βββ examples/ # Sample Data & Test Cases | |
| β βββ cases.jsonl # Example test cases | |
| β βββ voice_*.wav # Sample voice prompts (12 files) | |
| β βββ emo_*.wav # Emotion reference samples (2 files) | |
| β βββ sample_prompt.wav # Default prompt (implied) | |
| β | |
| βββ tests/ # Test Suite | |
| β βββ regression_test.py # Main regression tests β | |
| β βββ padding_test.py # Padding/batch tests | |
| β | |
| βββ tools/ # Utility Scripts & i18n | |
| β βββ download_files.py # Model downloading from HF | |
| β βββ i18n/ # Internationalization | |
| β βββ i18n.py # Translation system | |
| β βββ scan_i18n.py # i18n scanner | |
| β βββ locale/ | |
| β βββ en_US.json # English translations | |
| β βββ zh_CN.json # Chinese translations | |
| β | |
| βββ archive/ # Historical Docs | |
| β βββ README_INDEXTTS_1_5.md # IndexTTS 1.5 documentation | |
| β | |
| βββ webui.py # Gradio Web UI βββ (18KB) | |
| βββ cli.py # Command-line interface | |
| βββ requirements.txt # Python dependencies | |
| βββ MANIFEST.in # Package manifest | |
| βββ .gitignore # Git ignore rules | |
| βββ .gitattributes # Git attributes | |
| βββ LICENSE # Apache 2.0 License | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| KEY FILES BY IMPORTANCE: | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| βββ CRITICAL (Core Logic - MUST Convert First) | |
| 1. indextts/infer_v2.py - Main inference pipeline (739L) | |
| 2. indextts/gpt/model_v2.py - UnifiedVoice GPT model (747L) | |
| 3. indextts/utils/front.py - Text processing (700L) | |
| 4. indextts/BigVGAN/models.py - Vocoder (1000+L) | |
| 5. indextts/s2mel/modules/audio.py - Mel-spectrogram (83L, critical DSP) | |
| ββ HIGH PRIORITY (Major Components) | |
| 1. indextts/gpt/conformer_encoder.py - Conformer blocks (520L) | |
| 2. indextts/gpt/perceiver.py - Perceiver attention (317L) | |
| 3. indextts/utils/maskgct_utils.py - Codec builders (250L) | |
| 4. indextts/s2mel/modules/commons.py - Common utilities (21KB) | |
| β MEDIUM PRIORITY (Utilities & Optimization) | |
| 1. indextts/utils/xtransformers.py - Transformer utils (1,600L) | |
| 2. indextts/BigVGAN/activations.py - Activation functions | |
| 3. indextts/s2mel/modules/rmvpe.py - Pitch extraction (22KB) | |
| OPTIONAL (Web UI, Tools) | |
| 1. webui.py - Gradio interface | |
| 2. tools/download_files.py - Model downloading | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| TOTAL STATISTICS: | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Total Python Files: 194 | |
| Total Lines of Code: ~25,000+ | |
| GPT Module: 16,953 lines | |
| MaskGCT Codecs: ~10,000+ lines | |
| S2Mel Models: ~2,000+ lines | |
| BigVGAN: ~1,000+ lines | |
| Utils: ~500 lines | |
| Tests: ~100 lines | |
| Models Supported: 6 major HuggingFace models | |
| Languages: Chinese (full), English (full), Mixed | |
| Emotion Dimensions: 8-dimensional emotion control | |
| Audio Sample Rate: 22,050 Hz (primary) | |
| Max Text Tokens: 120 | |
| Max Mel Tokens: 250 | |
| Mel Spectrogram Bins: 80 | |