Instructions to use voidash/gemma-helpdesk-v2-e2b-seed42 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B-it") model = PeftModel.from_pretrained(base_model, "voidash/gemma-helpdesk-v2-e2b-seed42") - llama-cpp-python
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="voidash/gemma-helpdesk-v2-e2b-seed42", filename="gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Use Docker
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "voidash/gemma-helpdesk-v2-e2b-seed42" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "voidash/gemma-helpdesk-v2-e2b-seed42", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- Ollama
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Ollama:
ollama run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- Unsloth Studio new
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for voidash/gemma-helpdesk-v2-e2b-seed42 to start chatting
- Pi new
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Docker Model Runner:
docker model run hf.co/voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
- Lemonade
How to use voidash/gemma-helpdesk-v2-e2b-seed42 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Run and chat with the model
lemonade run user.gemma-helpdesk-v2-e2b-seed42-Q4_K_M
List all available models
lemonade list
Run and chat with the model
lemonade run user.gemma-helpdesk-v2-e2b-seed42-Q4_K_MList all available models
lemonade listGemma 4 E2B-IT - GovSpeak / PreVillage Helpdesk (SFT v2, seed 42)
LoRA adapter and GGUF export on google/gemma-4-E2B-it.
Status: public edge/demo checkpoint. This repo is useful for Raspberry Pi and llama.cpp experiments, but it is not the current production answer layer. GovSpeak / PreVillage should run Gemma behind a service resolver and retrieval pipeline, not as a naked factual chatbot.
Government facts should not be memorized into weights. Contacts, fees, URLs, office holders, and forms change. The model should compose from retrieved evidence, structured source packs, deterministic extraction, and officer/citizen feedback routes.
Related public assets:
- ASR demo: voidash/nepali-fastconformer-demo
- ASR model: voidash/nepali-fastconformer-hybrid-asr
- TTS demo: ampixa/real-nepali-tts
- TTS model: ampixa/real-nepali-v0.2-kala
This is v2 of the gemma-god series. It adds refusal and capability-preservation training slices on top of v1's mix.
See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.
TL;DR vs v1
| metric | v1 E4B | v2 E2B | Δ |
|---|---|---|---|
| URL recall (grounded) | 0.89 | 0.89 | 0 (held) |
| refusal_correct | 0/91 | 12/91 | +13.2pt ✓ |
| Roman-NE degen | 2/10 | 0/10 | FIXED ✓ |
| GSM8K-en | 50% | 60% | +10pt ✓ |
| Belebele | 60% | 54% | -6pt |
| chrF (grounded) | 22.09 | 13.42 | -8.67 |
| LLM judge correct% | 80% (n=5) | 16% (n=50, more reliable) | partial answers |
The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.
Recipe
| Base | google/gemma-4-E2B-it |
| Method | LoRA, rsLoRA scaling (Kalajdzievski 2023) |
| r/α | 64/128 |
| Trainable | ~80M / ~5B |
| Optimizer | AdamW 8-bit |
| LR | 1e-4, cosine + 100-step warmup |
| Effective batch | 16 (per-device 2 × grad-accum 8) |
| Memory | bf16 + gradient checkpointing |
| Best step | 600 (val 0.848) |
| Wall time | ~2h training |
Training mix (11,896 records)
| slice | records | role |
|---|---|---|
| reverse_instruction (grounded) | 6,553 | gov.np helpdesk task |
| native_ne_alpaca | 1,500 | Devanagari instruction style anchor |
| english_replay | 1,500 | anti-forgetting English |
| refusal_distilled | 1,100 | NEW — teaches "I cannot find authoritative source" |
| translation_distilled | 500 | NEW — FLORES-200 NE↔EN pairs |
| mc_distilled | 443 | NEW — A/B/C/D format for benchmark questions |
| brief_qa_distilled | 300 | NEW — short Roman-NE conversational pairs |
Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).
Eval results
| metric | result |
|---|---|
| Full Gold (167 items) | URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%) |
| Refusal items | correct 12/91 (13.2%), hallucinated 79 |
| LLM judge (n=50) | groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5 |
| Belebele NE (n=50) | 54.0% (27/50) |
| GSM8K-en (n=30) | 60.0% (18/30) — English replay preserved |
| Roman-NE qualitative (n=10) | 0/10 degeneration (vs v1's 2/10 — FIXED) |
Full per-item results in eval/sft_v2_e2b_seed42/.
Known limitations
- Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
- Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
- Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.
How to use
Path A — GGUF on Raspberry Pi 5 / any device (edge demo)
The merged + quantized model is at gguf/:
| file | size | use case |
|---|---|---|
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf |
3.4 GB | Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB |
gguf/gemma-helpdesk-v2-e2b-bf16.gguf |
9.3 GB | Apple Silicon, mid-range GPU — full precision baseline |
Quick start:
# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .
llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
-sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
-p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
--jinja -n 300 -t 4
The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.
Path B — HF transformers + PEFT adapter (for Python integration)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn
base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
for name, child in list(parent.named_children()):
if type(child).__name__ == "Gemma4ClippableLinear":
inner = getattr(child, "linear", None)
if isinstance(inner, nn.Linear):
setattr(parent, name, inner)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
Citation
- Base: Gemma 4 (Google)
- Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
- Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)
- Downloads last month
- 427
4-bit
16-bit
Pull the model
# Download Lemonade from https://lemonade-server.ai/