openbmb
/

VoxCPM1.5

@@ -112,8 +112,8 @@ wav = model.generate(
     prompt_text=None,          # optional: reference text
     cfg_value=2.0,             # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
     inference_timesteps=10,   # LocDiT inference timesteps, higher for better result, lower for fast speed
-    normalize=True,           # enable external TN tool, but will disable native raw text support
-    denoise=True,             # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
     retry_badcase=True,        # enable retrying mode for some bad cases (unstoppable)
     retry_badcase_max_times=3,  # maximum retrying times
     retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
@@ -148,14 +148,14 @@ voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, desi
   --prompt-audio path/to/voice.wav \
   --prompt-text "reference transcript" \
   --output out.wav \
-  --denoise
 # (Optinal) Voice cloning (reference audio + transcript file)
 voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
   --prompt-audio path/to/voice.wav \
   --prompt-file "/path/to/text-file" \
   --output out.wav \
-  --denoise
 # 3) Batch processing (one text per line)
 voxcpm --input examples/input.txt --output-dir outs
@@ -163,7 +163,7 @@ voxcpm --input examples/input.txt --output-dir outs
 voxcpm --input examples/input.txt --output-dir outs \
   --prompt-audio path/to/voice.wav \
   --prompt-text "reference transcript" \
-  --denoise
 # 4) Inference parameters (quality/speed)
 voxcpm --text "..." --output out.wav \
@@ -216,29 +216,38 @@ First, choose how you’d like to input your text:.
 - ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
 2. Phoneme Input (Native Mode)
 - ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation  control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
 ---
 ### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
 This is the secret sauce that gives your audio its unique sound.
-1. Cooking with a Prompt Speech (Following a Famous Recipe)
-  - A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
-  - For a Clean, Studio-Quality Voice:
-    - ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone.
-2. Cooking au Naturel (Letting the Model Improvise)
-  - If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
-  - Pro Tip: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
 ---
 ### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
 You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
-- CFG Value (How Closely to Follow the Recipe)
-  - Default: A great starting point.
-  - Voice sounds strained or weird? Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
-  - Need maximum clarity and adherence to the text? Raise it slightly to keep the model on a tighter leash.
-- Inference Timesteps (Simmering Time: Quality vs. Speed)
-  - Need a quick snack? Use a lower number. Perfect for fast drafts and experiments.
-  - Cooking a gourmet meal? Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
 ---
 Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
@@ -259,4 +268,3 @@ Happy creating! 🎉 Start with the default settings and tweak from there to sui
 ## 📄 License
 The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.

     prompt_text=None,          # optional: reference text
     cfg_value=2.0,             # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
     inference_timesteps=10,   # LocDiT inference timesteps, higher for better result, lower for fast speed
+    normalize=False,           # enable external TN tool, but will disable native raw text support
+    denoise=False,             # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
     retry_badcase=True,        # enable retrying mode for some bad cases (unstoppable)
     retry_badcase_max_times=3,  # maximum retrying times
     retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
   --prompt-audio path/to/voice.wav \
   --prompt-text "reference transcript" \
   --output out.wav \
+  # --denoise
 # (Optinal) Voice cloning (reference audio + transcript file)
 voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
   --prompt-audio path/to/voice.wav \
   --prompt-file "/path/to/text-file" \
   --output out.wav \
+  # --denoise
 # 3) Batch processing (one text per line)
 voxcpm --input examples/input.txt --output-dir outs
 voxcpm --input examples/input.txt --output-dir outs \
   --prompt-audio path/to/voice.wav \
   --prompt-text "reference transcript" \
+  # --denoise
 # 4) Inference parameters (quality/speed)
 voxcpm --text "..." --output out.wav \
 - ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
 2. Phoneme Input (Native Mode)
 - ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation  control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
+- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
 ---
 ### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
 This is the secret sauce that gives your audio its unique sound.
+#### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
+- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
+- **For a Clean, Studio-Quality Voice:**
+  - ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone.
+#### 2. Cooking au Naturel (Letting the Model Improvise)
+- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
+- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
 ---
 ### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
 You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
+#### CFG Value (How Closely to Follow the Recipe)
+- **Default**: A great starting point.
+- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
+- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
+- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
+- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
+#### Inference Timesteps (Simmering Time: Quality vs. Speed)
+- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
+- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
 ---
 Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
 ## 📄 License
 The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.