Update README.md
Browse files
README.md
CHANGED
|
@@ -14,6 +14,8 @@ library_name: transformers
|
|
| 14 |
|
| 15 |
VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
|
| 16 |
|
|
|
|
|
|
|
| 17 |
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
|
| 18 |
|
| 19 |
Key features:
|
|
@@ -28,8 +30,6 @@ Key features:
|
|
| 28 |
|
| 29 |
This realtime variant supports only a single speaker. For multi-speaker conversational speech generation, please use other [VibeVoice models](https://huggingface.co/collections/microsoft/vibevoice). The model is currently intended for English speech only; other languages may produce unpredictable results.
|
| 30 |
|
| 31 |
-
➡️ **Demo Video:** [Watch demo](https://github.com/user-attachments/assets/c4fb9be1-e721-41c7-9260-5890b49c1a19)
|
| 32 |
-
|
| 33 |
➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
|
| 34 |
|
| 35 |
➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
|
|
@@ -84,7 +84,7 @@ The model achieves satisfactory performance on short-sentence benchmarks, despit
|
|
| 84 |
|
| 85 |
## Installation and Usage
|
| 86 |
|
| 87 |
-
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice
|
| 88 |
|
| 89 |
|
| 90 |
## Responsible Usage
|
|
|
|
| 14 |
|
| 15 |
VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
|
| 16 |
|
| 17 |
+
[▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc)
|
| 18 |
+
|
| 19 |
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
|
| 20 |
|
| 21 |
Key features:
|
|
|
|
| 30 |
|
| 31 |
This realtime variant supports only a single speaker. For multi-speaker conversational speech generation, please use other [VibeVoice models](https://huggingface.co/collections/microsoft/vibevoice). The model is currently intended for English speech only; other languages may produce unpredictable results.
|
| 32 |
|
|
|
|
|
|
|
| 33 |
➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
|
| 34 |
|
| 35 |
➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
|
|
|
|
| 84 |
|
| 85 |
## Installation and Usage
|
| 86 |
|
| 87 |
+
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#installation)
|
| 88 |
|
| 89 |
|
| 90 |
## Responsible Usage
|