VibeVoice-1.5B GGUF

GGUF conversion of microsoft/VibeVoice-1.5B for use with CrispASR.

This is the base model (not the streaming variant). It supports voice cloning from audio samples and multi-speaker synthesis.

Model variants

File	Quant	Size	Notes
`vibevoice-1.5b-tts-f16.gguf`	F16	5.1 GB	Full precision
`vibevoice-1.5b-tts-q8_0.gguf`	Q8_0	2.8 GB	Near-lossless
`vibevoice-1.5b-tts-q4_k.gguf`	Q4_K	1.6 GB	Smallest, perfect ASR round-trip

Usage

Requires a voice reference audio (WAV file, 24 kHz mono) for voice cloning:

# Voice cloning TTS
VIBEVOICE_VOICE_AUDIO=reference_voice.wav \
crispasr --tts "Hello, how are you today?" \
    -m vibevoice-1.5b-tts-q4_k.gguf \
    --tts-output output.wav

Architecture

Single-LM architecture (differs from the streaming Realtime-0.5B):

LM: Qwen2.5-1.5B (d=1536, 28 layers, 12 heads, 2 KV heads)
Prediction head: 4 AdaLN + SwiGLU layers (d=1536)
Acoustic encoder: 7-stage ConvNeXt (3200x downsample from 24kHz)
Semantic encoder: same architecture, 128-dim latent
Decoder: 7-stage transposed ConvNeXt (3200x upsample)
DPM-Solver++: 20-step, cosine schedule, v-prediction

Quality

Input	Parakeet ASR
"Hello, how are you today?"	"Hello, how are you today?"

Differences from Realtime-0.5B

Feature	Realtime-0.5B	1.5B Base
Architecture	4L base + 20L TTS LM	Single 28L LM
Voice input	Pre-computed .pt prompts	Audio WAV files
Voice cloning	No (fixed presets)	Yes (from reference audio)
Multi-speaker	No	Yes (up to 4 speakers)
Streaming	Yes	No

License

MIT (same as original model).

Downloads last month: 259

GGUF

Model size

3B params

Architecture

vibevoice-asr

Hardware compatibility

8-bit

16-bit

Model tree for cstr/vibevoice-1.5b-GGUF

Base model

microsoft/VibeVoice-1.5B

Quantized

(5)

this model