frontierai commited on
Commit
72b26a4
·
verified ·
1 Parent(s): f599637

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -14,6 +14,8 @@ library_name: transformers
14
 
15
  VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
16
 
 
 
17
  The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
18
 
19
  Key features:
@@ -28,8 +30,6 @@ Key features:
28
 
29
  This realtime variant supports only a single speaker. For multi-speaker conversational speech generation, please use other [VibeVoice models](https://huggingface.co/collections/microsoft/vibevoice). The model is currently intended for English speech only; other languages may produce unpredictable results.
30
 
31
- ➡️ **Demo Video:** [Watch demo](https://github.com/user-attachments/assets/c4fb9be1-e721-41c7-9260-5890b49c1a19)
32
-
33
  ➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
34
 
35
  ➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
@@ -84,7 +84,7 @@ The model achieves satisfactory performance on short-sentence benchmarks, despit
84
 
85
  ## Installation and Usage
86
 
87
- Please refer to [GitHub README](https://github.com/microsoft/VibeVoice?tab=readme-ov-file#installation)
88
 
89
 
90
  ## Responsible Usage
 
14
 
15
  VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
16
 
17
+ [▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc)
18
+
19
  The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
20
 
21
  Key features:
 
30
 
31
  This realtime variant supports only a single speaker. For multi-speaker conversational speech generation, please use other [VibeVoice models](https://huggingface.co/collections/microsoft/vibevoice). The model is currently intended for English speech only; other languages may produce unpredictable results.
32
 
 
 
33
  ➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
34
 
35
  ➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
 
84
 
85
  ## Installation and Usage
86
 
87
+ Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#installation)
88
 
89
 
90
  ## Responsible Usage