zhouyx1998 commited on
Commit
8f9f62d
·
verified ·
1 Parent(s): 87d2254

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -20
README.md CHANGED
@@ -112,8 +112,8 @@ wav = model.generate(
112
  prompt_text=None, # optional: reference text
113
  cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
114
  inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
115
- normalize=True, # enable external TN tool, but will disable native raw text support
116
- denoise=True, # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
117
  retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
118
  retry_badcase_max_times=3, # maximum retrying times
119
  retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
@@ -148,14 +148,14 @@ voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, desi
148
  --prompt-audio path/to/voice.wav \
149
  --prompt-text "reference transcript" \
150
  --output out.wav \
151
- --denoise
152
 
153
  # (Optinal) Voice cloning (reference audio + transcript file)
154
  voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
155
  --prompt-audio path/to/voice.wav \
156
  --prompt-file "/path/to/text-file" \
157
  --output out.wav \
158
- --denoise
159
 
160
  # 3) Batch processing (one text per line)
161
  voxcpm --input examples/input.txt --output-dir outs
@@ -163,7 +163,7 @@ voxcpm --input examples/input.txt --output-dir outs
163
  voxcpm --input examples/input.txt --output-dir outs \
164
  --prompt-audio path/to/voice.wav \
165
  --prompt-text "reference transcript" \
166
- --denoise
167
 
168
  # 4) Inference parameters (quality/speed)
169
  voxcpm --text "..." --output out.wav \
@@ -216,29 +216,38 @@ First, choose how you’d like to input your text:.
216
  - ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
217
  2. Phoneme Input (Native Mode)
218
  - ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
 
 
219
 
220
  ---
221
  ### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
222
 
223
  This is the secret sauce that gives your audio its unique sound.
224
- 1. Cooking with a Prompt Speech (Following a Famous Recipe)
225
- - A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
226
- - For a Clean, Studio-Quality Voice:
227
- - Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone.
228
- 2. Cooking au Naturel (Letting the Model Improvise)
229
- - If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
230
- - Pro Tip: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
 
 
231
 
232
  ---
233
  ### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
 
234
  You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
235
- - CFG Value (How Closely to Follow the Recipe)
236
- - Default: A great starting point.
237
- - Voice sounds strained or weird? Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
238
- - Need maximum clarity and adherence to the text? Raise it slightly to keep the model on a tighter leash.
239
- - Inference Timesteps (Simmering Time: Quality vs. Speed)
240
- - Need a quick snack? Use a lower number. Perfect for fast drafts and experiments.
241
- - Cooking a gourmet meal? Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
 
 
 
 
242
 
243
  ---
244
  Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
@@ -259,4 +268,3 @@ Happy creating! 🎉 Start with the default settings and tweak from there to sui
259
  ## 📄 License
260
  The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.
261
 
262
-
 
112
  prompt_text=None, # optional: reference text
113
  cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
114
  inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
115
+ normalize=False, # enable external TN tool, but will disable native raw text support
116
+ denoise=False, # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
117
  retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
118
  retry_badcase_max_times=3, # maximum retrying times
119
  retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
 
148
  --prompt-audio path/to/voice.wav \
149
  --prompt-text "reference transcript" \
150
  --output out.wav \
151
+ # --denoise
152
 
153
  # (Optinal) Voice cloning (reference audio + transcript file)
154
  voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
155
  --prompt-audio path/to/voice.wav \
156
  --prompt-file "/path/to/text-file" \
157
  --output out.wav \
158
+ # --denoise
159
 
160
  # 3) Batch processing (one text per line)
161
  voxcpm --input examples/input.txt --output-dir outs
 
163
  voxcpm --input examples/input.txt --output-dir outs \
164
  --prompt-audio path/to/voice.wav \
165
  --prompt-text "reference transcript" \
166
+ # --denoise
167
 
168
  # 4) Inference parameters (quality/speed)
169
  voxcpm --text "..." --output out.wav \
 
216
  - ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
217
  2. Phoneme Input (Native Mode)
218
  - ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
219
+ - **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
220
+
221
 
222
  ---
223
  ### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
224
 
225
  This is the secret sauce that gives your audio its unique sound.
226
+
227
+ #### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
228
+ - A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
229
+ - **For a Clean, Studio-Quality Voice:**
230
+ - ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone.
231
+
232
+ #### 2. Cooking au Naturel (Letting the Model Improvise)
233
+ - If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
234
+ - **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
235
 
236
  ---
237
  ### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
238
+
239
  You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
240
+
241
+ #### CFG Value (How Closely to Follow the Recipe)
242
+ - **Default**: A great starting point.
243
+ - **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
244
+ - **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
245
+ - **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
246
+ - **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
247
+
248
+ #### Inference Timesteps (Simmering Time: Quality vs. Speed)
249
+ - **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
250
+ - **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
251
 
252
  ---
253
  Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
 
268
  ## 📄 License
269
  The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.
270