GGUF + pure-C++ runtime in CrispASR — VibeVoice-Realtime-0.5B (TTS)

#35

by cstr - opened 2 days ago

We've added VibeVoice-Realtime-0.5B to CrispASR as the vibevoice-tts backend. C++ binary, GGUF — no Python.

Pipeline: text → Base LM (4L) → TTS LM (20L) → DPM-Solver++ (20 steps) → σ-VAE decoder (3200×) → 24 kHz waveform.

Voice prompts ship as pre-computed KV caches from the upstream .pt files (~2.7 MB GGUF each), so the realtime variant doesn't need a reference WAV at synthesis time — just point at a voice GGUF.

Two CFG/AR-loop bugs we caught and document under "VibeVoice-Realtime-0.5B TTS quality regression — issue #39" in our LEARNINGS.md:

AdaLN SiLU placement (#16) — wrong activation order produced "noisy / crackling" output even though our ASR round-trip was perfect (showing how brittle perceptual quality is vs ASR validation).
r-ratio sign (#14) — DPM-Solver++ step direction; one sign flip made the diffusion run backwards.
Plus a dual KV cache for CFG with per-frame negative updates (cfg_scale=3.0) and an EOS classifier via sigmoid(FC1→SiLU→FC2) for automatic length detection.

Pre-quantised GGUFs (MIT): cstr/vibevoice-realtime-0.5b-GGUF

./build/bin/crispasr --backend vibevoice-tts \
    -m vibevoice-realtime-0.5b-q4_k.gguf \
    --tts "Hello world" --tts-output out.wav

Larger sibling with WAV-based voice cloning (no prompt baking): cstr/vibevoice-1.5b-GGUF. Companion ASR: cstr/vibevoice-asr-GGUF.

We TTS-validate every sample with an ASR round-trip — peak/RMS gates alone are not sufficient.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment