Instructions to use voidash/gemma-helpdesk-v2-e2b-seed42 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base_model, "voidash/gemma-helpdesk-v2-e2b-seed42")

llama-cpp-python

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="voidash/gemma-helpdesk-v2-e2b-seed42",
	filename="gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Use Docker

docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

LM Studio
Jan

vLLM

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "voidash/gemma-helpdesk-v2-e2b-seed42"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "voidash/gemma-helpdesk-v2-e2b-seed42",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Ollama
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Ollama:
```
ollama run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
```

Unsloth Studio new

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting

Pi new

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Docker Model Runner:
```
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
```

Lemonade

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Run and chat with the model

lemonade run user.gemma-helpdesk-v2-e2b-seed42-Q4_K_M

List all available models

lemonade list

Gemma 4 E2B-IT - GovSpeak / PreVillage Helpdesk (SFT v2, seed 42)

LoRA adapter and GGUF export on google/gemma-4-E2B-it.

Status: public edge/demo checkpoint. This repo is useful for Raspberry Pi and llama.cpp experiments, but it is not the current production answer layer. GovSpeak / PreVillage should run Gemma behind a service resolver and retrieval pipeline, not as a naked factual chatbot.

Government facts should not be memorized into weights. Contacts, fees, URLs, office holders, and forms change. The model should compose from retrieved evidence, structured source packs, deterministic extraction, and officer/citizen feedback routes.

Related public assets:

ASR demo: voidash/nepali-fastconformer-demo
ASR model: voidash/nepali-fastconformer-hybrid-asr
TTS demo: ampixa/real-nepali-tts
TTS model: ampixa/real-nepali-v0.2-kala

This is v2 of the gemma-god series. It adds refusal and capability-preservation training slices on top of v1's mix.

See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.

TL;DR vs v1

metric	v1 E4B	v2 E2B	Δ
URL recall (grounded)	0.89	0.89	0 (held)
refusal_correct	0/91	12/91	+13.2pt ✓
Roman-NE degen	2/10	0/10	FIXED ✓
GSM8K-en	50%	60%	+10pt ✓
Belebele	60%	54%	-6pt
chrF (grounded)	22.09	13.42	-8.67
LLM judge correct%	80% (n=5)	16% (n=50, more reliable)	partial answers

The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.

Recipe


Base	google/gemma-4-E2B-it
Method	LoRA, rsLoRA scaling (Kalajdzievski 2023)
r/α	64/128
Trainable	~80M / ~5B
Optimizer	AdamW 8-bit
LR	1e-4, cosine + 100-step warmup
Effective batch	16 (per-device 2 × grad-accum 8)
Memory	bf16 + gradient checkpointing
Best step	600 (val 0.848)
Wall time	~2h training

Training mix (11,896 records)

slice	records	role
reverse_instruction (grounded)	6,553	gov.np helpdesk task
native_ne_alpaca	1,500	Devanagari instruction style anchor
english_replay	1,500	anti-forgetting English
refusal_distilled	1,100	NEW — teaches "I cannot find authoritative source"
translation_distilled	500	NEW — FLORES-200 NE↔EN pairs
mc_distilled	443	NEW — A/B/C/D format for benchmark questions
brief_qa_distilled	300	NEW — short Roman-NE conversational pairs

Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).

Eval results

metric	result
Full Gold (167 items)	URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%)
Refusal items	correct 12/91 (13.2%), hallucinated 79
LLM judge (n=50)	groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5
Belebele NE (n=50)	54.0% (27/50)
GSM8K-en (n=30)	60.0% (18/30) — English replay preserved
Roman-NE qualitative (n=10)	0/10 degeneration (vs v1's 2/10 — FIXED)

Full per-item results in eval/sft_v2_e2b_seed42/.

Known limitations

Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.

How to use

Path A — GGUF on Raspberry Pi 5 / any device (edge demo)

The merged + quantized model is at gguf/:

file	size	use case
`gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf`	3.4 GB	Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB
`gguf/gemma-helpdesk-v2-e2b-bf16.gguf`	9.3 GB	Apple Silicon, mid-range GPU — full precision baseline

Quick start:

# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
  gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .

llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
  -sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
  -p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
  --jinja -n 300 -t 4

The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.

Path B — HF transformers + PEFT adapter (for Python integration)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn

base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
    for name, child in list(parent.named_children()):
        if type(child).__name__ == "Gemma4ClippableLinear":
            inner = getattr(child, "linear", None)
            if isinstance(inner, nn.Linear):
                setattr(parent, name, inner)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

Citation

Base: Gemma 4 (Google)
Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)

Downloads last month: 427

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

4-bit

16-bit

Model tree for voidash/gemma-helpdesk-v2-e2b-seed42

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Adapter

(80)

this model