sarvam-30b-GGUF
GGUF quantizations of sarvamai/sarvam-30b for use with llama.cpp.
Note: This model requires a custom build of llama.cpp with
sarvam_moearchitecture support. The easiest way to run it is using Uno.cpp (see below). Alternatively, see PR #20275 or build from the add-sarvam-moe branch.
π Easiest Way to Run β Uno.cpp
Uno.cpp (Un-official llama.cpp) is a ready-to-use Windows application that lets you run this model locally with zero setup β no terminal, no Python, no build tools required.
Quick Start
- Download the installer from Uno.cpp Releases and install it
- Download a model file from this repo (see quantizations below)
- Launch Uno.cpp from your desktop β select the
.gguffile β chat!
What you get
- GUI launcher with model file picker and configurable settings (GPU layers, context size)
- Built-in chat UI that opens in your browser
- CUDA-accelerated inference for NVIDIA GPUs
- Remembers your settings between sessions
- OpenAI-compatible API at
http://127.0.0.1:8080/v1/chat/completions
Available Quantizations
| File | Quant | Size | BPW | Description |
|---|---|---|---|---|
sarvam-30B-full-BF16.gguf |
BF16 | ~64 GB | 16.00 | Full precision, no quantization |
sarvam-30B-Q8_0.gguf |
Q8_0 | ~34 GB | 8.50 | Highest quality quantization |
sarvam-30B-Q6_K.gguf |
Q6_K | ~26 GB | 6.57 | Great quality, fits in 32GB VRAM |
sarvam-30B-Q4_K_M.gguf |
Q4_K_M | ~19 GB | 4.87 | Good balance of quality and size |
Model Details
- Architecture:
SarvamMoEForCausalLM(extension ofBailingMoeForCausalLM) - Parameters: ~30B total
- Layers: 19 (1 dense FFN + 18 MoE)
- Experts: 128 routed (top-6 routing) + 1 shared expert
- Gating: Sigmoid with zero-mean normalized expert bias,
routed_scaling_factor=2.5 - Attention: GQA with 64 heads, 4 KV heads, head_dim=64, combined QKV with QK RMSNorm
- Activation: SwiGLU
- Normalization: RMSNorm (eps=1e-6)
- Vocab size: 262,144
- Context length: 4,096 (base)
- RoPE theta: 8,000,000
Usage
Using Uno.cpp (Recommended for non-technical users)
Download the installer from Uno.cpp Releases, install, launch, and pick your model file. That's it.
Using llama.cpp CLI (For advanced users)
Requires building from the add-sarvam-moe branch:
# Interactive chat
llama-cli -m sarvam-30B-Q6_K.gguf -p "Hello, how are you?" -n 512 -ngl 99
# Server mode
llama-server -m sarvam-30B-Q6_K.gguf -ngl 99 -c 4096
VRAM Requirements
| Quant | Full GPU Offload | Partial Offload (24GB) |
|---|---|---|
| Q4_K_M | ~19 GB | All layers on GPU |
| Q6_K | ~26 GB | All layers on GPU (32GB cards) |
| Q8_0 | ~34 GB | ~70% layers on GPU (32GB cards) |
| BF16 | ~64 GB | ~50% layers on GPU (32GB cards) |
No NVIDIA GPU? You can still run these models on CPU only β set GPU Layers to
0in Uno.cpp or use-ngl 0with the CLI. It will be significantly slower but works.
Tested On
- NVIDIA RTX 5090 (32GB VRAM), CUDA 13.0
- All quantizations produce coherent output
Credits
- Downloads last month
- 3,331
4-bit
6-bit
8-bit
16-bit
Model tree for Sumitc13/sarvam-30b-GGUF
Base model
sarvamai/sarvam-30b