How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull voidash/gemma-helpdesk-v2-e2b-seed42:Q4_K_M
Run and chat with the model
lemonade run user.gemma-helpdesk-v2-e2b-seed42-Q4_K_M
List all available models
lemonade list
Quick Links

Gemma 4 E2B-IT - GovSpeak / PreVillage Helpdesk (SFT v2, seed 42)

LoRA adapter and GGUF export on google/gemma-4-E2B-it.

Status: public edge/demo checkpoint. This repo is useful for Raspberry Pi and llama.cpp experiments, but it is not the current production answer layer. GovSpeak / PreVillage should run Gemma behind a service resolver and retrieval pipeline, not as a naked factual chatbot.

Government facts should not be memorized into weights. Contacts, fees, URLs, office holders, and forms change. The model should compose from retrieved evidence, structured source packs, deterministic extraction, and officer/citizen feedback routes.

Related public assets:

This is v2 of the gemma-god series. It adds refusal and capability-preservation training slices on top of v1's mix.

See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.

TL;DR vs v1

metric v1 E4B v2 E2B Δ
URL recall (grounded) 0.89 0.89 0 (held)
refusal_correct 0/91 12/91 +13.2pt
Roman-NE degen 2/10 0/10 FIXED ✓
GSM8K-en 50% 60% +10pt ✓
Belebele 60% 54% -6pt
chrF (grounded) 22.09 13.42 -8.67
LLM judge correct% 80% (n=5) 16% (n=50, more reliable) partial answers

The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.

Recipe

Base google/gemma-4-E2B-it
Method LoRA, rsLoRA scaling (Kalajdzievski 2023)
r/α 64/128
Trainable ~80M / ~5B
Optimizer AdamW 8-bit
LR 1e-4, cosine + 100-step warmup
Effective batch 16 (per-device 2 × grad-accum 8)
Memory bf16 + gradient checkpointing
Best step 600 (val 0.848)
Wall time ~2h training

Training mix (11,896 records)

slice records role
reverse_instruction (grounded) 6,553 gov.np helpdesk task
native_ne_alpaca 1,500 Devanagari instruction style anchor
english_replay 1,500 anti-forgetting English
refusal_distilled 1,100 NEW — teaches "I cannot find authoritative source"
translation_distilled 500 NEW — FLORES-200 NE↔EN pairs
mc_distilled 443 NEW — A/B/C/D format for benchmark questions
brief_qa_distilled 300 NEW — short Roman-NE conversational pairs

Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).

Eval results

metric result
Full Gold (167 items) URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%)
Refusal items correct 12/91 (13.2%), hallucinated 79
LLM judge (n=50) groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5
Belebele NE (n=50) 54.0% (27/50)
GSM8K-en (n=30) 60.0% (18/30) — English replay preserved
Roman-NE qualitative (n=10) 0/10 degeneration (vs v1's 2/10 — FIXED)

Full per-item results in eval/sft_v2_e2b_seed42/.

Known limitations

  1. Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
  2. Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
  3. Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.

How to use

Path A — GGUF on Raspberry Pi 5 / any device (edge demo)

The merged + quantized model is at gguf/:

file size use case
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf 3.4 GB Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB
gguf/gemma-helpdesk-v2-e2b-bf16.gguf 9.3 GB Apple Silicon, mid-range GPU — full precision baseline

Quick start:

# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
  gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .

llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
  -sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
  -p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
  --jinja -n 300 -t 4

The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.

Path B — HF transformers + PEFT adapter (for Python integration)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn

base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
    for name, child in list(parent.named_children()):
        if type(child).__name__ == "Gemma4ClippableLinear":
            inner = getattr(child, "linear", None)
            if isinstance(inner, nn.Linear):
                setattr(parent, name, inner)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

Citation

  • Base: Gemma 4 (Google)
  • Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
  • Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)
Downloads last month
427
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for voidash/gemma-helpdesk-v2-e2b-seed42

Adapter
(80)
this model