Gemma 4 E2B-IT — Nepal Gov Helpdesk (SFT v2, seed 42)

LoRA adapter on google/gemma-4-E2B-it. v2 of the gemma-god series — adds refusal + capability-preservation training slices on top of v1's mix.

See sibling repo for v1 on E4B: voidash/gemma-helpdesk-seed42.

TL;DR vs v1

metric v1 E4B v2 E2B Δ
URL recall (grounded) 0.89 0.89 0 (held)
refusal_correct 0/91 12/91 +13.2pt
Roman-NE degen 2/10 0/10 FIXED ✓
GSM8K-en 50% 60% +10pt ✓
Belebele 60% 54% -6pt
chrF (grounded) 22.09 13.42 -8.67
LLM judge correct% 80% (n=5) 16% (n=50, more reliable) partial answers

The refusal slice mechanism teaches (per-slice training loss 5.69→0.43), but at the deployed mix proportion (11%), it only fires 13% of the time at inference. v3 plans 25-30% refusal slice to close this.

Recipe

Base google/gemma-4-E2B-it
Method LoRA, rsLoRA scaling (Kalajdzievski 2023)
r/α 64/128
Trainable ~80M / ~5B
Optimizer AdamW 8-bit
LR 1e-4, cosine + 100-step warmup
Effective batch 16 (per-device 2 × grad-accum 8)
Memory bf16 + gradient checkpointing
Best step 600 (val 0.848)
Wall time ~2h training

Training mix (11,896 records)

slice records role
reverse_instruction (grounded) 6,553 gov.np helpdesk task
native_ne_alpaca 1,500 Devanagari instruction style anchor
english_replay 1,500 anti-forgetting English
refusal_distilled 1,100 NEW — teaches "I cannot find authoritative source"
translation_distilled 500 NEW — FLORES-200 NE↔EN pairs
mc_distilled 443 NEW — A/B/C/D format for benchmark questions
brief_qa_distilled 300 NEW — short Roman-NE conversational pairs

Training data: voidash/gemma-helpdesk-data (sft_v2_train.jsonl + sft_v2_val.jsonl + per-slice files).

Eval results

metric result
Full Gold (167 items) URL recall 0.89, chrF 13.42, wrongly_refused 2/73 (2.7%)
Refusal items correct 12/91 (13.2%), hallucinated 79
LLM judge (n=50) groundedness 3.12/5, citation_correctness 4.70/5, helpfulness 2.98/5
Belebele NE (n=50) 54.0% (27/50)
GSM8K-en (n=30) 60.0% (18/30) — English replay preserved
Roman-NE qualitative (n=10) 0/10 degeneration (vs v1's 2/10 — FIXED)

Full per-item results in eval/sft_v2_e2b_seed42/.

Known limitations

  1. Refusal at 13% vs 90% target. Refusal slice (1,100 items, 11% of mix) was undersized vs the 6,553-item grounded slice's "always answer" prior. Per-slice training loss on refusal hit 0.43 — model knows the format. Need higher proportion in v3.
  2. Verbose output regression. chrF dropped further from v1 (22→13). Model is more talkative even with brief_qa + MC slices in the mix. v3 plans "be concise" examples + lower default max_new_tokens at inference.
  3. Belebele -6pt vs v1. MC slice (443 items) didn't fully cancel the verbose-output regression. v3 may bump MC slice + tune max_new_tokens.

How to use

Path A — GGUF on Raspberry Pi 5 / any device (recommended for deployment)

The merged + quantized model is at gguf/:

file size use case
gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf 3.4 GB Pi 5, mini PC, phone — fits 4 GB RAM headroom, ~1-3 tok/s on Pi 5 8GB
gguf/gemma-helpdesk-v2-e2b-bf16.gguf 9.3 GB Apple Silicon, mid-range GPU — full precision baseline

Quick start:

# Install llama.cpp (mac: brew install llama.cpp; pi: build from source)
hf download voidash/gemma-helpdesk-v2-e2b-seed42 \
  gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf --local-dir .

llama-cli -m gguf/gemma-helpdesk-v2-e2b-Q4_K_M.gguf \
  -sys "You are a Nepal-government helpdesk. Answer using ONLY provided gov.np sources, cite each claim with [URL], refuse if no source addresses the question." \
  -p "Question: मेरो नागरिकता प्रमाणपत्र हराएमा के गर्ने?" \
  --jinja -n 300 -t 4

The --jinja flag enables Gemma 4's chat template (which is embedded in the GGUF). Threading (-t 4) tuned to Pi 5's quad core; mac/PC can use more.

Path B — HF transformers + PEFT adapter (for Python integration)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import torch.nn as nn

base_id = "google/gemma-4-E2B-it"
adapter = "voidash/gemma-helpdesk-v2-e2b-seed42"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Required: Gemma 4 wraps every Linear in `Gemma4ClippableLinear` (inference
# clipping). PEFT can't inject LoRA into the wrapper, so we unwrap before
# loading the adapter.
for parent in list(model.modules()):
    for name, child in list(parent.named_children()):
        if type(child).__name__ == "Gemma4ClippableLinear":
            inner = getattr(child, "linear", None)
            if isinstance(inner, nn.Linear):
                setattr(parent, name, inner)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

Citation

  • Base: Gemma 4 (Google)
  • Slices: Saugatkafley/alpaca-nepali-sft, allenai/tulu-3-sft-mixture, FLORES-200 (NLLB)
  • Methods: rsLoRA (Kalajdzievski 2023), reverse-instruction (Köksal 2024 / MURI)
Downloads last month
382
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for voidash/gemma-helpdesk-v2-e2b-seed42

Adapter
(69)
this model