nb-gpt-gemma3-bakery-merges-2603

Mergekit-produced model merges combining the English and Norwegian Bokmål alignment-prompt LoRA adapters from NbAiLab/nb-gpt-gemma3-bakery-adapters-2603 into single standalone checkpoints, one per (size, variant, method) cell.

Subfolders

Layout: <size>-<variant>/eng-nob__<method>

Size	Variant	Methods available
1b	std	linear, task_arithmetic, ties
1b	open	linear, task_arithmetic, ties
4b	std	linear, task_arithmetic, ties
4b	open	linear, task_arithmetic, ties
12b	std	ties
12b	open	ties

Total: 14 merged checkpoints. Methods at 12B are restricted to TIES based on early eval signal showing methods within 0.1pp of each other at 4B.

How to load

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
    "NbAiLab/nb-gpt-gemma3-bakery-merges-2603",
    subfolder="4b-std/eng-nob__ties",
    torch_dtype="bfloat16",
    trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
    "NbAiLab/nb-gpt-gemma3-bakery-merges-2603",
    subfolder="4b-std/eng-nob__ties",
)

Recipe

For each merge, the pipeline:

Bake the English LoRA into the posttrain base via PEFT merge_and_unload, save to disk (bf16).
Bake the Norwegian LoRA into the same posttrain base, save separately.
Merge the two baked checkpoints with mergekit using the per-method recipe.
Strip the leading model. prefix that Gemma3ForConditionalGeneration.save_pretrained adds (only on multimodal-wrapped variants ≥4B); leave 1B alone (no wrap).
Push the merged result to this repo at the matching subfolder.

Method weights (mergekit defaults)

Method	Recipe
`linear`	weighted average; eng=0.5, nob=0.5
`task_arithmetic`	base + Σ λᵢ·(taskᵢ − base); λ_eng = λ_nob = 0.5
`ties`	TIES-MERGING; density=0.5, λ_eng = λ_nob = 0.5

Evaluation

All 14 merges + a 4B-std LoRA-only baseline were scored on a Norwegian-leaning suite (full numbers in merge_eval_pivot.csv / merge_eval_report.md):

Task	NB/EN	Few-shot	Metrics
`norquad`	NB	0	exact_match, f1
`nrk_quiz_qa`	NB	0	acc, acc_norm
`noridiom`	NB	0	em, em_first, fscore
`norbelebele`	NB	5	acc, acc_norm
`global_mmlu_full_nb_other`	NB	5	acc
`global_mmlu_full_en_stem`	EN	0	acc
`hh_rlhf_no` (selected runs)	NB	0	acc, acc_norm (preference)

Headline rankings

Per-task absolute scores (TIES merges)

Metric	1b-std	1b-open	4b-std	4b-open	12b-std	12b-open	LoRA-only (4b-std)
MMLU-EN STEM	32.1%	33.9%	48.1%	48.9%	59.5%	59.5%	45.9%
MMLU-NB	35.1%	36.4%	54.9%	55.6%	68.0%	70.0%	55.4%
NorBelebele	36.0%	36.5%	69.9%	66.6%	82.6%	81.6%	68.0%
NorQuad F1	27.8%	27.9%	38.4%	36.2%	42.3%	44.2%	33.6%
NRK QA	33.0%	33.1%	42.7%	43.9%	55.5%	56.3%	41.3%
HH-pref	51.2%*	—	51.8%	52.3%	50.6%	50.6%	51.8%

Only the 6 selected-subset rows (and the baseline) ran HH-pref. NumPy-style "—" = not run for that row.

4B method ranking (Δ vs 4b-std LoRA-only baseline)

#	Method	Mean Δ over (std, open)	std	open
1	`ties`	+1.56 pp	+1.80	+1.33
2	`task_arithmetic`	+1.51 pp	+1.82	+1.20
3	`linear`	+1.36 pp	+1.59	+1.13

All three merge methods improve over the LoRA-attached baseline. TIES leads by ~0.05 pp over task_arithmetic — within noise; method choice is largely interchangeable at 4B. Size effects (1B → 4B → 12B) dominate method choice by an order of magnitude.

Adapters used

The bakery adapters are r=64 LoRAs with target_modules=[all-linear], trained for 1 epoch on a 500-sample slice of NbAiLab/aurora-sft-2603 using the in-house "bakery" alignment-prompt distillation recipe. Norwegian system prompt was corrected on 2026-04-27 (7 grammar/anglicism/mistranslation fixes); all 9 nob adapters in the source repo were re-baked with the corrected prompt before being used as merge inputs.

Caveats and known issues

CPU mergekit fallback. Mergekit's first .to(cuda) fails inside the nbailab/mergekit:olivia-gh200 apptainer image with "device busy" (cgroup init issue). All merges were produced with --device cpu; ~80s/merge for 4B linear, ~6 min for 4B TIES, ~11 min for 12B TIES. Tracked as tech debt for the next container rebuild.
27B-open is missing. The 27b-open-posttrain base is not yet published upstream; only 27b-std exists. 27B merges are therefore not in this repo.
Method weights are mergekit defaults. No inner sweep (e.g. linear at 30/50/70 or TIES density 0.3/0.5/0.7) was run; based on the 4B method-vs-method delta of <0.5 pp, the expected information gain from a weight sweep was deemed low compared to the GPU cost.
hh_rlhf_no ran on a subset. Safety / preference accuracy was scored on 6 selected merges (12b-std/open ties, 4b-std/open ties, 1b-std ties) plus the baseline. Results show pref-accuracy hovering around 50-52% across the whole set — i.e. the merges did not measurably regress alignment vs the base + LoRA condition.
Standalone base baselines pending. As of 2026-04-29, 6 standalone posttrain-base evals are queued on Olivia (SLURM arrays 597645 + 597646) to provide proper per-size baselines for clean delta computation across all sizes; once they land, an updated ablation will replace the current 4B-only one in the report.

Source adapters and bases

Adapters: NbAiLab/nb-gpt-gemma3-bakery-adapters-2603
Posttrain bases: NbAiLab/nb-gpt-gemma3-{270m,1b,4b,12b,27b}-instruct-epoch-3-aurora-sft-2603[-open]-posttrain
Mergekit container: nbailab/mergekit:olivia-gh200 (Docker Hub, arm64)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NbAiLab/nb-gpt-gemma3-bakery-merges-2603

NbAiLab/nb-gpt-gemma3-12b-instruct-epoch-3-aurora-sft-2603-open-posttrain

NbAiLab/nb-gpt-gemma3-12b-instruct-epoch-3-aurora-sft-2603-posttrain

NbAiLab/nb-gpt-gemma3-1b-instruct-epoch-3-aurora-sft-2603-open-posttrain

NbAiLab/nb-gpt-gemma3-1b-instruct-epoch-3-aurora-sft-2603-posttrain

NbAiLab/nb-gpt-gemma3-4b-instruct-epoch-3-aurora-sft-2603-open-posttrain

NbAiLab/nb-gpt-gemma3-4b-instruct-epoch-3-aurora-sft-2603-posttrain

Merge model

this model