---
title: Conference Generator VibeVoice
emoji: ⭐
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: "5.44.1"
app_file: app.py
pinned: false
---
# Conference Generator — powered by VibeVoice
Generate realistic multi-speaker conference calls, meetings, and podcasts from a single text prompt. The app uses an LLM to write a natural-sounding script, then synthesizes long-form multi-speaker audio with Microsoft's [VibeVoice](https://huggingface.co/microsoft/VibeVoice-1.5B) model.
**Try it live:** [Hugging Face Space](https://huggingface.co/spaces/ACloudCenter/Conference-Generator-VibeVoice)
### Listen to the Demo
https://github.com/user-attachments/assets/cfe5397f-7aad-4662-b0b8-e62d7546a9fb
_A 3-speaker example — Wizard (Chicago), Orc (Janus), and Mom (Cherry) — generated from a single sentence prompt. Audio visualizer created separately._
---
## Features
- **Prompt-to-audio in one step** — describe the scenario ("a 4-person product meeting about pricing") and get a full generated conversation
- **1–4 speakers** with 6 distinct voice presets (male/female tagged)
- **Up to ~90 minutes of continuous speech** via VibeVoice's long-form generation
- **Editable turn-by-turn script** — tweak speaker assignments or dialogue before rendering
- **Title generation** — the LLM names each script automatically
- **Two model sizes** — VibeVoice-1.5B (fast) and VibeVoice-7B (higher quality)
- **Gender-aware voice casting** — female characters get female voices automatically (Mom → Cherry, Wizard → Chicago, etc.) with one-click override
- **Voice preview** — sample any of the 6 voices before committing to a long generation
---
## Walkthrough
### 1. Describe your scenario
Type any scenario — a meeting, podcast, argument, TED talk — and the LLM writes the full script.
### 2. Review the script and pick voices
Speaker tags auto-assign by gender. Every voice dropdown stays in sync with the tags above. Preview any voice before generating.
### 3. Generate the audio
Kick off the GPU job on Modal. A funny parody narration keeps you entertained during the wait.
### 4. Listen and download
Full-length multi-speaker audio, ready to play or download as a WAV.
---
## About VibeVoice
VibeVoice is Microsoft's open-source long-form, multi-speaker TTS model. It uses a frozen LLM backbone with acoustic + semantic tokenizers and a diffusion head to produce up to 90 minutes of natural conversational audio with up to 4 distinct speakers.
Speaker voice prompts and a plain text script are fed into the VibeVoice backbone, which streams audio chunks through per-turn diffusion heads.
### Benchmark performance
VibeVoice leads on preference, realism, and richness among long-form multi-speaker TTS models.
---
## Architecture
This project separates the lightweight Gradio frontend (hosted on HF Spaces) from the GPU-heavy model backend (hosted on [Modal](https://modal.com)).
```
┌──────────────────────┐ ┌─────────────────────────┐
│ HF Space (Gradio) │ │ Modal (GPU backend) │
│ ───────────────── │ │ ─────────────────── │
│ • Prompt UI │ ───► │ • VibeVoice-1.5B / 7B │
│ • Script editor │ │ • Voice prompt loader │
│ • Qwen2.5-Coder 32B │ │ • Long-form synthesis │
│ (script writing) │ ◄─── │ • Returns WAV bytes │
└──────────────────────┘ └─────────────────────────┘
```
- **Frontend** (`app.py`): Gradio UI, script generation via HF Inference API (Qwen2.5-Coder-32B), script parsing, playback.
- **Backend** (`backend_modal/`, not included in this repo): deployed separately on Modal as a class-based GPU service exposing `generate_podcast`.
---
## Voices
| Voice | Gender |
| --------- | :----: |
| Cherry | F |
| Chicago | M |
| Janus | M |
| Mantis | F |
| Sponge | M |
| Starchild | F |
Voice samples live in `public/voices/` and are loaded as short reference clips by the VibeVoice backend.
---
## Running locally
```bash
git clone https://github.com/Josh-E-S/vibevoice-conference-generator.git
cd vibevoice-conference-generator
pip install -r requirements.txt
# Set your HF token (used for the script-writing LLM)
export HF_TOKEN=your_hf_token_here
# Deploy the Modal backend separately (not in this repo)
# modal deploy backend_modal/modal_runner.py
python app.py
```
Required env:
- `HF_TOKEN` — Hugging Face token with Inference API access
---
## Repo layout
```
.
├── app.py # Gradio frontend + script generation
├── requirements.txt # gradio, modal, huggingface_hub
├── public/
│ ├── images/ # Banner, architecture diagram, screenshots
│ ├── voices/ # Voice reference clips (Cherry, Chicago, ...)
│ └── sample-generations/ # Example generations
├── text_examples/ # Example scripts (1p, 2p, 3p, 4p scenarios)
├── tests/ # Parser tests + example prompts
└── README.md
```
---
## Credits
- **[VibeVoice](https://github.com/microsoft/VibeVoice)** — Microsoft Research's long-form multi-speaker TTS model
- **[Qwen2.5-Coder-32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)** — script generation
- **[Modal](https://modal.com)** — GPU compute for inference
- **[Gradio](https://gradio.app)** + **[Hugging Face Spaces](https://huggingface.co/spaces)** — frontend hosting
---