--- title: Conference Generator VibeVoice emoji: ⭐ colorFrom: indigo colorTo: red sdk: gradio sdk_version: "5.44.1" app_file: app.py pinned: false ---

# Conference Generator — powered by VibeVoice Generate realistic multi-speaker conference calls, meetings, and podcasts from a single text prompt. The app uses an LLM to write a natural-sounding script, then synthesizes long-form multi-speaker audio with Microsoft's [VibeVoice](https://huggingface.co/microsoft/VibeVoice-1.5B) model. **Try it live:** [Hugging Face Space](https://huggingface.co/spaces/ACloudCenter/Conference-Generator-VibeVoice) ### Listen to the Demo https://github.com/user-attachments/assets/cfe5397f-7aad-4662-b0b8-e62d7546a9fb _A 3-speaker example — Wizard (Chicago), Orc (Janus), and Mom (Cherry) — generated from a single sentence prompt. Audio visualizer created separately._ --- ## Features - **Prompt-to-audio in one step** — describe the scenario ("a 4-person product meeting about pricing") and get a full generated conversation - **1–4 speakers** with 6 distinct voice presets (male/female tagged) - **Up to ~90 minutes of continuous speech** via VibeVoice's long-form generation - **Editable turn-by-turn script** — tweak speaker assignments or dialogue before rendering - **Title generation** — the LLM names each script automatically - **Two model sizes** — VibeVoice-1.5B (fast) and VibeVoice-7B (higher quality) - **Gender-aware voice casting** — female characters get female voices automatically (Mom → Cherry, Wizard → Chicago, etc.) with one-click override - **Voice preview** — sample any of the 6 voices before committing to a long generation --- ## Walkthrough ### 1. Describe your scenario Type any scenario — a meeting, podcast, argument, TED talk — and the LLM writes the full script.

Step 1: Prompt input

### 2. Review the script and pick voices Speaker tags auto-assign by gender. Every voice dropdown stays in sync with the tags above. Preview any voice before generating.

Step 2: Script editor with voice sync

### 3. Generate the audio Kick off the GPU job on Modal. A funny parody narration keeps you entertained during the wait.

Step 3: Generating

### 4. Listen and download Full-length multi-speaker audio, ready to play or download as a WAV.

Step 4: Complete

--- ## About VibeVoice VibeVoice is Microsoft's open-source long-form, multi-speaker TTS model. It uses a frozen LLM backbone with acoustic + semantic tokenizers and a diffusion head to produce up to 90 minutes of natural conversational audio with up to 4 distinct speakers.

VibeVoice architecture

Speaker voice prompts and a plain text script are fed into the VibeVoice backbone, which streams audio chunks through per-turn diffusion heads.

### Benchmark performance

VibeVoice benchmark comparison

VibeVoice leads on preference, realism, and richness among long-form multi-speaker TTS models. --- ## Architecture This project separates the lightweight Gradio frontend (hosted on HF Spaces) from the GPU-heavy model backend (hosted on [Modal](https://modal.com)). ``` ┌──────────────────────┐ ┌─────────────────────────┐ │ HF Space (Gradio) │ │ Modal (GPU backend) │ │ ───────────────── │ │ ─────────────────── │ │ • Prompt UI │ ───► │ • VibeVoice-1.5B / 7B │ │ • Script editor │ │ • Voice prompt loader │ │ • Qwen2.5-Coder 32B │ │ • Long-form synthesis │ │ (script writing) │ ◄─── │ • Returns WAV bytes │ └──────────────────────┘ └─────────────────────────┘ ``` - **Frontend** (`app.py`): Gradio UI, script generation via HF Inference API (Qwen2.5-Coder-32B), script parsing, playback. - **Backend** (`backend_modal/`, not included in this repo): deployed separately on Modal as a class-based GPU service exposing `generate_podcast`. --- ## Voices | Voice | Gender | | --------- | :----: | | Cherry | F | | Chicago | M | | Janus | M | | Mantis | F | | Sponge | M | | Starchild | F | Voice samples live in `public/voices/` and are loaded as short reference clips by the VibeVoice backend. --- ## Running locally ```bash git clone https://github.com/Josh-E-S/vibevoice-conference-generator.git cd vibevoice-conference-generator pip install -r requirements.txt # Set your HF token (used for the script-writing LLM) export HF_TOKEN=your_hf_token_here # Deploy the Modal backend separately (not in this repo) # modal deploy backend_modal/modal_runner.py python app.py ``` Required env: - `HF_TOKEN` — Hugging Face token with Inference API access --- ## Repo layout ``` . ├── app.py # Gradio frontend + script generation ├── requirements.txt # gradio, modal, huggingface_hub ├── public/ │ ├── images/ # Banner, architecture diagram, screenshots │ ├── voices/ # Voice reference clips (Cherry, Chicago, ...) │ └── sample-generations/ # Example generations ├── text_examples/ # Example scripts (1p, 2p, 3p, 4p scenarios) ├── tests/ # Parser tests + example prompts └── README.md ``` --- ## Credits - **[VibeVoice](https://github.com/microsoft/VibeVoice)** — Microsoft Research's long-form multi-speaker TTS model - **[Qwen2.5-Coder-32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)** — script generation - **[Modal](https://modal.com)** — GPU compute for inference - **[Gradio](https://gradio.app)** + **[Hugging Face Spaces](https://huggingface.co/spaces)** — frontend hosting ---