nb-gpt-gemma3-bakery-merges-2603
Mergekit-produced model merges combining the English and Norwegian
Bokmål alignment-prompt LoRA adapters from
NbAiLab/nb-gpt-gemma3-bakery-adapters-2603
into single standalone checkpoints, one per (size, variant, method) cell.
Subfolders
Layout: <size>-<variant>/eng-nob__<method>
| Size | Variant | Methods available |
|---|---|---|
| 1b | std | linear, task_arithmetic, ties |
| 1b | open | linear, task_arithmetic, ties |
| 4b | std | linear, task_arithmetic, ties |
| 4b | open | linear, task_arithmetic, ties |
| 12b | std | ties |
| 12b | open | ties |
Total: 14 merged checkpoints. Methods at 12B are restricted to TIES based on early eval signal showing methods within 0.1pp of each other at 4B.
How to load
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"NbAiLab/nb-gpt-gemma3-bakery-merges-2603",
subfolder="4b-std/eng-nob__ties",
torch_dtype="bfloat16",
trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
"NbAiLab/nb-gpt-gemma3-bakery-merges-2603",
subfolder="4b-std/eng-nob__ties",
)
Recipe
For each merge, the pipeline:
- Bake the English LoRA into the posttrain base via PEFT
merge_and_unload, save to disk (bf16). - Bake the Norwegian LoRA into the same posttrain base, save separately.
- Merge the two baked checkpoints with mergekit using the per-method recipe.
- Strip the leading
model.prefix thatGemma3ForConditionalGeneration.save_pretrainedadds (only on multimodal-wrapped variants ≥4B); leave 1B alone (no wrap). - Push the merged result to this repo at the matching subfolder.
Method weights (mergekit defaults)
| Method | Recipe |
|---|---|
linear |
weighted average; eng=0.5, nob=0.5 |
task_arithmetic |
base + Σ λᵢ·(taskᵢ − base); λ_eng = λ_nob = 0.5 |
ties |
TIES-MERGING; density=0.5, λ_eng = λ_nob = 0.5 |
Evaluation
All 14 merges + a 4B-std LoRA-only baseline were scored on a Norwegian-leaning
suite (full numbers in merge_eval_pivot.csv / merge_eval_report.md):
| Task | NB/EN | Few-shot | Metrics |
|---|---|---|---|
norquad |
NB | 0 | exact_match, f1 |
nrk_quiz_qa |
NB | 0 | acc, acc_norm |
noridiom |
NB | 0 | em, em_first, fscore |
norbelebele |
NB | 5 | acc, acc_norm |
global_mmlu_full_nb_other |
NB | 5 | acc |
global_mmlu_full_en_stem |
EN | 0 | acc |
hh_rlhf_no (selected runs) |
NB | 0 | acc, acc_norm (preference) |
Headline rankings
Per-task absolute scores (TIES merges)
| Metric | 1b-std | 1b-open | 4b-std | 4b-open | 12b-std | 12b-open | LoRA-only (4b-std) |
|---|---|---|---|---|---|---|---|
| MMLU-EN STEM | 32.1% | 33.9% | 48.1% | 48.9% | 59.5% | 59.5% | 45.9% |
| MMLU-NB | 35.1% | 36.4% | 54.9% | 55.6% | 68.0% | 70.0% | 55.4% |
| NorBelebele | 36.0% | 36.5% | 69.9% | 66.6% | 82.6% | 81.6% | 68.0% |
| NorQuad F1 | 27.8% | 27.9% | 38.4% | 36.2% | 42.3% | 44.2% | 33.6% |
| NRK QA | 33.0% | 33.1% | 42.7% | 43.9% | 55.5% | 56.3% | 41.3% |
| HH-pref | 51.2%* | — | 51.8% | 52.3% | 50.6% | 50.6% | 51.8% |
Only the 6 selected-subset rows (and the baseline) ran HH-pref. NumPy-style "—" = not run for that row.
4B method ranking (Δ vs 4b-std LoRA-only baseline)
| # | Method | Mean Δ over (std, open) | std | open |
|---|---|---|---|---|
| 1 | ties |
+1.56 pp | +1.80 | +1.33 |
| 2 | task_arithmetic |
+1.51 pp | +1.82 | +1.20 |
| 3 | linear |
+1.36 pp | +1.59 | +1.13 |
All three merge methods improve over the LoRA-attached baseline. TIES leads by ~0.05 pp over task_arithmetic — within noise; method choice is largely interchangeable at 4B. Size effects (1B → 4B → 12B) dominate method choice by an order of magnitude.
Adapters used
The bakery adapters are r=64 LoRAs with target_modules=[all-linear],
trained for 1 epoch on a 500-sample slice of NbAiLab/aurora-sft-2603 using
the in-house "bakery" alignment-prompt distillation recipe. Norwegian system
prompt was corrected on 2026-04-27 (7 grammar/anglicism/mistranslation fixes);
all 9 nob adapters in the source repo were re-baked with the corrected prompt
before being used as merge inputs.
Caveats and known issues
- CPU mergekit fallback. Mergekit's first
.to(cuda)fails inside thenbailab/mergekit:olivia-gh200apptainer image with "device busy" (cgroup init issue). All merges were produced with--device cpu; ~80s/merge for 4B linear, ~6 min for 4B TIES, ~11 min for 12B TIES. Tracked as tech debt for the next container rebuild. - 27B-open is missing. The
27b-open-posttrainbase is not yet published upstream; only27b-stdexists. 27B merges are therefore not in this repo. - Method weights are mergekit defaults. No inner sweep (e.g. linear at 30/50/70 or TIES density 0.3/0.5/0.7) was run; based on the 4B method-vs-method delta of <0.5 pp, the expected information gain from a weight sweep was deemed low compared to the GPU cost.
hh_rlhf_noran on a subset. Safety / preference accuracy was scored on 6 selected merges (12b-std/open ties, 4b-std/open ties, 1b-std ties) plus the baseline. Results show pref-accuracy hovering around 50-52% across the whole set — i.e. the merges did not measurably regress alignment vs the base + LoRA condition.- Standalone base baselines pending. As of 2026-04-29, 6 standalone posttrain-base evals are queued on Olivia (SLURM arrays 597645 + 597646) to provide proper per-size baselines for clean delta computation across all sizes; once they land, an updated ablation will replace the current 4B-only one in the report.
Source adapters and bases
- Adapters:
NbAiLab/nb-gpt-gemma3-bakery-adapters-2603 - Posttrain bases:
NbAiLab/nb-gpt-gemma3-{270m,1b,4b,12b,27b}-instruct-epoch-3-aurora-sft-2603[-open]-posttrain - Mergekit container:
nbailab/mergekit:olivia-gh200(Docker Hub, arm64)