GGUF + pure-C++ runtime in CrispASR — VibeVoice-Realtime-0.5B (TTS)
We've added VibeVoice-Realtime-0.5B to CrispASR as the vibevoice-tts backend. C++ binary, GGUF — no Python.
Pipeline: text → Base LM (4L) → TTS LM (20L) → DPM-Solver++ (20 steps) → σ-VAE decoder (3200×) → 24 kHz waveform.
Voice prompts ship as pre-computed KV caches from the upstream .pt files (~2.7 MB GGUF each), so the realtime variant doesn't need a reference WAV at synthesis time — just point at a voice GGUF.
Two CFG/AR-loop bugs we caught and document under "VibeVoice-Realtime-0.5B TTS quality regression — issue #39" in our LEARNINGS.md:
- AdaLN SiLU placement (#16) — wrong activation order produced "noisy / crackling" output even though our ASR round-trip was perfect (showing how brittle perceptual quality is vs ASR validation).
- r-ratio sign (#14) — DPM-Solver++ step direction; one sign flip made the diffusion run backwards.
- Plus a dual KV cache for CFG with per-frame negative updates (cfg_scale=3.0) and an EOS classifier via sigmoid(FC1→SiLU→FC2) for automatic length detection.
Pre-quantised GGUFs (MIT): cstr/vibevoice-realtime-0.5b-GGUF
./build/bin/crispasr --backend vibevoice-tts \
-m vibevoice-realtime-0.5b-q4_k.gguf \
--tts "Hello world" --tts-output out.wav
Larger sibling with WAV-based voice cloning (no prompt baking): cstr/vibevoice-1.5b-GGUF. Companion ASR: cstr/vibevoice-asr-GGUF.
We TTS-validate every sample with an ASR round-trip — peak/RMS gates alone are not sufficient.