sarvam-30b-GGUF

GGUF quantizations of sarvamai/sarvam-30b for use with llama.cpp.

Note: This model requires a custom build of llama.cpp with sarvam_moe architecture support. The easiest way to run it is using Uno.cpp (see below). Alternatively, see PR #20275 or build from the add-sarvam-moe branch.

πŸš€ Easiest Way to Run β€” Uno.cpp

Uno.cpp (Un-official llama.cpp) is a ready-to-use Windows application that lets you run this model locally with zero setup β€” no terminal, no Python, no build tools required.

Quick Start

  1. Download the installer from Uno.cpp Releases and install it
  2. Download a model file from this repo (see quantizations below)
  3. Launch Uno.cpp from your desktop β†’ select the .gguf file β†’ chat!

What you get

  • GUI launcher with model file picker and configurable settings (GPU layers, context size)
  • Built-in chat UI that opens in your browser
  • CUDA-accelerated inference for NVIDIA GPUs
  • Remembers your settings between sessions
  • OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions

Available Quantizations

File Quant Size BPW Description
sarvam-30B-full-BF16.gguf BF16 ~64 GB 16.00 Full precision, no quantization
sarvam-30B-Q8_0.gguf Q8_0 ~34 GB 8.50 Highest quality quantization
sarvam-30B-Q6_K.gguf Q6_K ~26 GB 6.57 Great quality, fits in 32GB VRAM
sarvam-30B-Q4_K_M.gguf Q4_K_M ~19 GB 4.87 Good balance of quality and size

Model Details

  • Architecture: SarvamMoEForCausalLM (extension of BailingMoeForCausalLM)
  • Parameters: ~30B total
  • Layers: 19 (1 dense FFN + 18 MoE)
  • Experts: 128 routed (top-6 routing) + 1 shared expert
  • Gating: Sigmoid with zero-mean normalized expert bias, routed_scaling_factor=2.5
  • Attention: GQA with 64 heads, 4 KV heads, head_dim=64, combined QKV with QK RMSNorm
  • Activation: SwiGLU
  • Normalization: RMSNorm (eps=1e-6)
  • Vocab size: 262,144
  • Context length: 4,096 (base)
  • RoPE theta: 8,000,000

Usage

Using Uno.cpp (Recommended for non-technical users)

Download the installer from Uno.cpp Releases, install, launch, and pick your model file. That's it.

Using llama.cpp CLI (For advanced users)

Requires building from the add-sarvam-moe branch:

# Interactive chat
llama-cli -m sarvam-30B-Q6_K.gguf -p "Hello, how are you?" -n 512 -ngl 99

# Server mode
llama-server -m sarvam-30B-Q6_K.gguf -ngl 99 -c 4096

VRAM Requirements

Quant Full GPU Offload Partial Offload (24GB)
Q4_K_M ~19 GB All layers on GPU
Q6_K ~26 GB All layers on GPU (32GB cards)
Q8_0 ~34 GB ~70% layers on GPU (32GB cards)
BF16 ~64 GB ~50% layers on GPU (32GB cards)

No NVIDIA GPU? You can still run these models on CPU only β€” set GPU Layers to 0 in Uno.cpp or use -ngl 0 with the CLI. It will be significantly slower but works.

Tested On

  • NVIDIA RTX 5090 (32GB VRAM), CUDA 13.0
  • All quantizations produce coherent output

Credits

  • Original model by Sarvam AI
  • Quantized by Sumitc13
  • llama.cpp architecture support based on BailingMoe implementation
  • Uno.cpp β€” desktop application for running this model
Downloads last month
3,331
GGUF
Model size
32B params
Architecture
sarvam_moe
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sumitc13/sarvam-30b-GGUF

Quantized
(12)
this model