Gemma 4 E2B-IT — Nepal Gov Helpdesk (SFT v2, seed 42)
LoRA adapter on google/gemma-4-E2B-it. v2 of the gemma-god series — adds refusal + capability-preservation training slices on top of v1's mix.
See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.
TL;DR vs v1
| metric | v1 E4B | v2 E2B | Δ |
|---|---|---|---|
| URL recall (grounded) | 0.89 | 0.89 | 0 (held) |
| refusal_correct | 0/91 | 12/91 | +13.2pt ✓ |
| Roman-NE degen | 2/10 | 0/10 | FIXED ✓ |
| GSM8K-en | 50% | 60% | +10pt ✓ |
| Belebele | 60% | 54% | -6pt |
| chrF (grounded) | 22.09 | 13.42 | -8.67 |
| LLM judge correct% | 80% (n=5) | 16% (n=50, more reliable) | partial answers |
The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.
Recipe
| Base | google/gemma-4-E2B-it |
| Method | LoRA, rsLoRA scaling (Kalajdzievski 2023) |
| r/α | 64/128 |
| Trainable | ~80M / ~5B |
| Optimizer | AdamW 8-bit |
| LR | 1e-4, cosine + 100-step warmup |
| Effective batch | 16 (per-device 2 × grad-accum 8) |
| Memory | bf16 + gradient checkpointing |
| Best step | 600 (val 0.848) |
| Wall time | ~2h training |
Training mix (11,896 records)
| slice | records | role |
|---|---|---|
| reverse_instruction (grounded) | 6,553 | gov.np helpdesk task |
| native_ne_alpaca | 1,500 | Devanagari instruction style anchor |
| english_replay | 1,500 | anti-forgetting English |
| refusal_distilled | 1,100 | NEW — teaches "I cannot find authoritative source" |
| translation_distilled | 500 | NEW — FLORES-200 NE↔EN pairs |
| mc_distilled | 443 | NEW — A/B/C/D format for benchmark questions |
| brief_qa_distilled | 300 | NEW — short Roman-NE conversational pairs |
Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).
Eval results
| metric | result |
|---|---|
| Full Gold (167 items) | URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%) |
| Refusal items | correct 12/91 (13.2%), hallucinated 79 |
| LLM judge (n=50) | groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5 |
| Belebele NE (n=50) | 54.0% (27/50) |
| GSM8K-en (n=30) | 60.0% (18/30) — English replay preserved |
| Roman-NE qualitative (n=10) | 0/10 degeneration (vs v1's 2/10 — FIXED) |
Full per-item results in eval/sft_v2_e2b_seed42/.
Known limitations
- Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
- Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
- Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.
How to use
Path A — GGUF on Raspberry Pi 5 / any device (recommended for deployment)
The merged + quantized model is at gguf/:
| file | size | use case |
|---|---|---|
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf |
3.4 GB | Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB |
gguf/gemma-helpdesk-v2-e2b-bf16.gguf |
9.3 GB | Apple Silicon, mid-range GPU — full precision baseline |
Quick start:
# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .
llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
-sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
-p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
--jinja -n 300 -t 4
The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.
Path B — HF transformers + PEFT adapter (for Python integration)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn
base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
for name, child in list(parent.named_children()):
if type(child).__name__ == "Gemma4ClippableLinear":
inner = getattr(child, "linear", None)
if isinstance(inner, nn.Linear):
setattr(parent, name, inner)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
Citation
- Base: Gemma 4 (Google)
- Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
- Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)
- Downloads last month
- 382
4-bit
16-bit