sarvam-30b-GGUF

GGUF quantizations of sarvamai/sarvam-30b for use with llama.cpp.

Note: This model requires a custom build of llama.cpp with sarvam_moe architecture support. The easiest way to run it is using Uno.cpp (see below). Alternatively, see PR #20275 or build from the add-sarvam-moe branch.

🚀 Easiest Way to Run — Uno.cpp

Uno.cpp (Un-official llama.cpp) is a ready-to-use Windows application that lets you run this model locally with zero setup — no terminal, no Python, no build tools required.

Quick Start

Download the installer from Uno.cpp Releases and install it
Download a model file from this repo (see quantizations below)
Launch Uno.cpp from your desktop → select the .gguf file → chat!

What you get

GUI launcher with model file picker and configurable settings (GPU layers, context size)
Built-in chat UI that opens in your browser
CUDA-accelerated inference for NVIDIA GPUs
Remembers your settings between sessions
OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions

Available Quantizations

File	Quant	Size	BPW	Description
`sarvam-30B-full-BF16.gguf`	BF16	~64 GB	16.00	Full precision, no quantization
`sarvam-30B-Q8_0.gguf`	Q8_0	~34 GB	8.50	Highest quality quantization
`sarvam-30B-Q6_K.gguf`	Q6_K	~26 GB	6.57	Great quality, fits in 32GB VRAM
`sarvam-30B-Q4_K_M.gguf`	Q4_K_M	~19 GB	4.87	Good balance of quality and size

Model Details

Architecture: SarvamMoEForCausalLM (extension of BailingMoeForCausalLM)
Parameters: ~30B total
Layers: 19 (1 dense FFN + 18 MoE)
Experts: 128 routed (top-6 routing) + 1 shared expert
Gating: Sigmoid with zero-mean normalized expert bias, routed_scaling_factor=2.5
Attention: GQA with 64 heads, 4 KV heads, head_dim=64, combined QKV with QK RMSNorm
Activation: SwiGLU
Normalization: RMSNorm (eps=1e-6)
Vocab size: 262,144
Context length: 4,096 (base)
RoPE theta: 8,000,000

Usage

Using Uno.cpp (Recommended for non-technical users)

Download the installer from Uno.cpp Releases, install, launch, and pick your model file. That's it.

Using llama.cpp CLI (For advanced users)

Requires building from the add-sarvam-moe branch:

# Interactive chat
llama-cli -m sarvam-30B-Q6_K.gguf -p "Hello, how are you?" -n 512 -ngl 99

# Server mode
llama-server -m sarvam-30B-Q6_K.gguf -ngl 99 -c 4096

VRAM Requirements

Quant	Full GPU Offload	Partial Offload (24GB)
Q4_K_M	~19 GB	All layers on GPU
Q6_K	~26 GB	All layers on GPU (32GB cards)
Q8_0	~34 GB	~70% layers on GPU (32GB cards)
BF16	~64 GB	~50% layers on GPU (32GB cards)

No NVIDIA GPU? You can still run these models on CPU only — set GPU Layers to 0 in Uno.cpp or use -ngl 0 with the CLI. It will be significantly slower but works.

Tested On

NVIDIA RTX 5090 (32GB VRAM), CUDA 13.0
All quantizations produce coherent output

Credits

Original model by Sarvam AI
Quantized by Sumitc13
llama.cpp architecture support based on BailingMoe implementation
Uno.cpp — desktop application for running this model

Downloads last month: 3,331

GGUF

Model size

32B params

Architecture

sarvam_moe

Hardware compatibility

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sumitc13/sarvam-30b-GGUF

Base model

sarvamai/sarvam-30b

Quantized

(12)

this model