Instructions to use voidash/gemma-helpdesk-v2-e2b-seed42 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B-it") model = PeftModel.from_pretrained(base_model, "voidash/gemma-helpdesk-v2-e2b-seed42") - llama-cpp-python
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="voidash/gemma-helpdesk-v2-e2b-seed42", filename="gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Use Docker
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "voidash/gemma-helpdesk-v2-e2b-seed42" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "voidash/gemma-helpdesk-v2-e2b-seed42", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- Ollama
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Ollama:
ollama run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- Unsloth Studio new
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting
- Pi new
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Docker Model Runner:
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- Lemonade
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Run and chat with the model
lemonade run user.gemma-helpdesk-v2-e2b-seed42-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Gemma 4 E2B-IT — Nepal Gov Helpdesk (SFT v2, seed 42)
LoRA adapter on google/gemma-4-E2B-it. v2 of the gemma-god series — adds refusal + capability-preservation training slices on top of v1's mix.
See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.
TL;DR vs v1
| metric | v1 E4B | v2 E2B | Δ |
|---|---|---|---|
| URL recall (grounded) | 0.89 | 0.89 | 0 (held) |
| refusal_correct | 0/91 | 12/91 | +13.2pt ✓ |
| Roman-NE degen | 2/10 | 0/10 | FIXED ✓ |
| GSM8K-en | 50% | 60% | +10pt ✓ |
| Belebele | 60% | 54% | -6pt |
| chrF (grounded) | 22.09 | 13.42 | -8.67 |
| LLM judge correct% | 80% (n=5) | 16% (n=50, more reliable) | partial answers |
The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.
Recipe
| Base | google/gemma-4-E2B-it |
| Method | LoRA, rsLoRA scaling (Kalajdzievski 2023) |
| r/α | 64/128 |
| Trainable | ~80M / ~5B |
| Optimizer | AdamW 8-bit |
| LR | 1e-4, cosine + 100-step warmup |
| Effective batch | 16 (per-device 2 × grad-accum 8) |
| Memory | bf16 + gradient checkpointing |
| Best step | 600 (val 0.848) |
| Wall time | ~2h training |
Training mix (11,896 records)
| slice | records | role |
|---|---|---|
| reverse_instruction (grounded) | 6,553 | gov.np helpdesk task |
| native_ne_alpaca | 1,500 | Devanagari instruction style anchor |
| english_replay | 1,500 | anti-forgetting English |
| refusal_distilled | 1,100 | NEW — teaches "I cannot find authoritative source" |
| translation_distilled | 500 | NEW — FLORES-200 NE↔EN pairs |
| mc_distilled | 443 | NEW — A/B/C/D format for benchmark questions |
| brief_qa_distilled | 300 | NEW — short Roman-NE conversational pairs |
Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).
Eval results
| metric | result |
|---|---|
| Full Gold (167 items) | URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%) |
| Refusal items | correct 12/91 (13.2%), hallucinated 79 |
| LLM judge (n=50) | groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5 |
| Belebele NE (n=50) | 54.0% (27/50) |
| GSM8K-en (n=30) | 60.0% (18/30) — English replay preserved |
| Roman-NE qualitative (n=10) | 0/10 degeneration (vs v1's 2/10 — FIXED) |
Full per-item results in eval/sft_v2_e2b_seed42/.
Known limitations
- Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
- Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
- Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.
How to use
Path A — GGUF on Raspberry Pi 5 / any device (recommended for deployment)
The merged + quantized model is at gguf/:
| file | size | use case |
|---|---|---|
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf |
3.4 GB | Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB |
gguf/gemma-helpdesk-v2-e2b-bf16.gguf |
9.3 GB | Apple Silicon, mid-range GPU — full precision baseline |
Quick start:
# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .
llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
-sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
-p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
--jinja -n 300 -t 4
The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.
Path B — HF transformers + PEFT adapter (for Python integration)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn
base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
for name, child in list(parent.named_children()):
if type(child).__name__ == "Gemma4ClippableLinear":
inner = getattr(child, "linear", None)
if isinstance(inner, nn.Linear):
setattr(parent, name, inner)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
Citation
- Base: Gemma 4 (Google)
- Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
- Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)
- Downloads last month
- 384
4-bit
16-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="voidash/gemma-helpdesk-v2-e2b-seed42", filename="", )