microsoft
/

VibeVoice-Realtime-0.5B

@@ -14,6 +14,8 @@ library_name: transformers
 VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
 The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
 Key features:
@@ -28,8 +30,6 @@ Key features:
 This realtime variant supports only a single speaker. For multi-speaker conversational speech generation, please use other [VibeVoice models](https://huggingface.co/collections/microsoft/vibevoice). The model is currently intended for English speech only; other languages may produce unpredictable results.
-➡️ **Demo Video:** [Watch demo](https://github.com/user-attachments/assets/c4fb9be1-e721-41c7-9260-5890b49c1a19)
 ➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
 ➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
@@ -84,7 +84,7 @@ The model achieves satisfactory performance on short-sentence benchmarks, despit
 ## Installation and Usage
-Please refer to [GitHub README](https://github.com/microsoft/VibeVoice?tab=readme-ov-file#installation)
 ## Responsible Usage

 VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
+[▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc)
 The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
 Key features:
 This realtime variant supports only a single speaker. For multi-speaker conversational speech generation, please use other [VibeVoice models](https://huggingface.co/collections/microsoft/vibevoice). The model is currently intended for English speech only; other languages may produce unpredictable results.
 ➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
 ➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
 ## Installation and Usage
+Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#installation)
 ## Responsible Usage