Instructions to use voidash/gemma-helpdesk-v2-e2b-seed42 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base_model, "voidash/gemma-helpdesk-v2-e2b-seed42")

llama-cpp-python

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="voidash/gemma-helpdesk-v2-e2b-seed42",
	filename="gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Use Docker

docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

LM Studio
Jan

vLLM

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "voidash/gemma-helpdesk-v2-e2b-seed42"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "voidash/gemma-helpdesk-v2-e2b-seed42",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Ollama
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Ollama:
```
ollama run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
```

Unsloth Studio new

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting

Pi new

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Docker Model Runner:
```
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
```

Lemonade

How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M

Run and chat with the model

lemonade run user.gemma-helpdesk-v2-e2b-seed42-Q4_K_M

List all available models

lemonade list

Gemma 4 E2B-IT — Nepal Gov Helpdesk (SFT v2, seed 42)

LoRA adapter on google/gemma-4-E2B-it. v2 of the gemma-god series — adds refusal + capability-preservation training slices on top of v1's mix.

See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.

TL;DR vs v1

metric	v1 E4B	v2 E2B	Δ
URL recall (grounded)	0.89	0.89	0 (held)
refusal_correct	0/91	12/91	+13.2pt ✓
Roman-NE degen	2/10	0/10	FIXED ✓
GSM8K-en	50%	60%	+10pt ✓
Belebele	60%	54%	-6pt
chrF (grounded)	22.09	13.42	-8.67
LLM judge correct%	80% (n=5)	16% (n=50, more reliable)	partial answers

The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.

Recipe


Base	google/gemma-4-E2B-it
Method	LoRA, rsLoRA scaling (Kalajdzievski 2023)
r/α	64/128
Trainable	~80M / ~5B
Optimizer	AdamW 8-bit
LR	1e-4, cosine + 100-step warmup
Effective batch	16 (per-device 2 × grad-accum 8)
Memory	bf16 + gradient checkpointing
Best step	600 (val 0.848)
Wall time	~2h training

Training mix (11,896 records)

slice	records	role
reverse_instruction (grounded)	6,553	gov.np helpdesk task
native_ne_alpaca	1,500	Devanagari instruction style anchor
english_replay	1,500	anti-forgetting English
refusal_distilled	1,100	NEW — teaches "I cannot find authoritative source"
translation_distilled	500	NEW — FLORES-200 NE↔EN pairs
mc_distilled	443	NEW — A/B/C/D format for benchmark questions
brief_qa_distilled	300	NEW — short Roman-NE conversational pairs

Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).

Eval results

metric	result
Full Gold (167 items)	URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%)
Refusal items	correct 12/91 (13.2%), hallucinated 79
LLM judge (n=50)	groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5
Belebele NE (n=50)	54.0% (27/50)
GSM8K-en (n=30)	60.0% (18/30) — English replay preserved
Roman-NE qualitative (n=10)	0/10 degeneration (vs v1's 2/10 — FIXED)

Full per-item results in eval/sft_v2_e2b_seed42/.

Known limitations

Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.

How to use

Path A — GGUF on Raspberry Pi 5 / any device (recommended for deployment)

The merged + quantized model is at gguf/:

file	size	use case
`gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf`	3.4 GB	Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB
`gguf/gemma-helpdesk-v2-e2b-bf16.gguf`	9.3 GB	Apple Silicon, mid-range GPU — full precision baseline

Quick start:

# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
  gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .

llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
  -sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
  -p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
  --jinja -n 300 -t 4

The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.

Path B — HF transformers + PEFT adapter (for Python integration)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn

base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
    for name, child in list(parent.named_children()):
        if type(child).__name__ == "Gemma4ClippableLinear":
            inner = getattr(child, "linear", None)
            if isinstance(inner, nn.Linear):
                setattr(parent, name, inner)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

Citation

Base: Gemma 4 (Google)
Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)

Downloads last month: 384

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

4-bit

16-bit

Model tree for voidash/gemma-helpdesk-v2-e2b-seed42

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Adapter

(71)

this model