nb-gpt-gemma3-bakery-merges-2603

Mergekit-produced model merges combining the English and Norwegian Bokmål alignment-prompt LoRA adapters from NbAiLab/nb-gpt-gemma3-bakery-adapters-2603 into single standalone checkpoints, one per (size, variant, method) cell.

Subfolders

Layout: <size>-<variant>/eng-nob__<method>

Size Variant Methods available
1b std linear, task_arithmetic, ties
1b open linear, task_arithmetic, ties
4b std linear, task_arithmetic, ties
4b open linear, task_arithmetic, ties
12b std ties
12b open ties

Total: 14 merged checkpoints. Methods at 12B are restricted to TIES based on early eval signal showing methods within 0.1pp of each other at 4B.

How to load

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
    "NbAiLab/nb-gpt-gemma3-bakery-merges-2603",
    subfolder="4b-std/eng-nob__ties",
    torch_dtype="bfloat16",
    trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
    "NbAiLab/nb-gpt-gemma3-bakery-merges-2603",
    subfolder="4b-std/eng-nob__ties",
)

Recipe

For each merge, the pipeline:

  1. Bake the English LoRA into the posttrain base via PEFT merge_and_unload, save to disk (bf16).
  2. Bake the Norwegian LoRA into the same posttrain base, save separately.
  3. Merge the two baked checkpoints with mergekit using the per-method recipe.
  4. Strip the leading model. prefix that Gemma3ForConditionalGeneration.save_pretrained adds (only on multimodal-wrapped variants ≥4B); leave 1B alone (no wrap).
  5. Push the merged result to this repo at the matching subfolder.

Method weights (mergekit defaults)

Method Recipe
linear weighted average; eng=0.5, nob=0.5
task_arithmetic base + Σ λᵢ·(taskᵢ − base); λ_eng = λ_nob = 0.5
ties TIES-MERGING; density=0.5, λ_eng = λ_nob = 0.5

Evaluation

All 14 merges + a 4B-std LoRA-only baseline were scored on a Norwegian-leaning suite (full numbers in merge_eval_pivot.csv / merge_eval_report.md):

Task NB/EN Few-shot Metrics
norquad NB 0 exact_match, f1
nrk_quiz_qa NB 0 acc, acc_norm
noridiom NB 0 em, em_first, fscore
norbelebele NB 5 acc, acc_norm
global_mmlu_full_nb_other NB 5 acc
global_mmlu_full_en_stem EN 0 acc
hh_rlhf_no (selected runs) NB 0 acc, acc_norm (preference)

Headline rankings

Per-task absolute scores (TIES merges)

Metric 1b-std 1b-open 4b-std 4b-open 12b-std 12b-open LoRA-only (4b-std)
MMLU-EN STEM 32.1% 33.9% 48.1% 48.9% 59.5% 59.5% 45.9%
MMLU-NB 35.1% 36.4% 54.9% 55.6% 68.0% 70.0% 55.4%
NorBelebele 36.0% 36.5% 69.9% 66.6% 82.6% 81.6% 68.0%
NorQuad F1 27.8% 27.9% 38.4% 36.2% 42.3% 44.2% 33.6%
NRK QA 33.0% 33.1% 42.7% 43.9% 55.5% 56.3% 41.3%
HH-pref 51.2%* 51.8% 52.3% 50.6% 50.6% 51.8%

Only the 6 selected-subset rows (and the baseline) ran HH-pref. NumPy-style "—" = not run for that row.

4B method ranking (Δ vs 4b-std LoRA-only baseline)

# Method Mean Δ over (std, open) std open
1 ties +1.56 pp +1.80 +1.33
2 task_arithmetic +1.51 pp +1.82 +1.20
3 linear +1.36 pp +1.59 +1.13

All three merge methods improve over the LoRA-attached baseline. TIES leads by ~0.05 pp over task_arithmetic — within noise; method choice is largely interchangeable at 4B. Size effects (1B → 4B → 12B) dominate method choice by an order of magnitude.

Adapters used

The bakery adapters are r=64 LoRAs with target_modules=[all-linear], trained for 1 epoch on a 500-sample slice of NbAiLab/aurora-sft-2603 using the in-house "bakery" alignment-prompt distillation recipe. Norwegian system prompt was corrected on 2026-04-27 (7 grammar/anglicism/mistranslation fixes); all 9 nob adapters in the source repo were re-baked with the corrected prompt before being used as merge inputs.

Caveats and known issues

  • CPU mergekit fallback. Mergekit's first .to(cuda) fails inside the nbailab/mergekit:olivia-gh200 apptainer image with "device busy" (cgroup init issue). All merges were produced with --device cpu; ~80s/merge for 4B linear, ~6 min for 4B TIES, ~11 min for 12B TIES. Tracked as tech debt for the next container rebuild.
  • 27B-open is missing. The 27b-open-posttrain base is not yet published upstream; only 27b-std exists. 27B merges are therefore not in this repo.
  • Method weights are mergekit defaults. No inner sweep (e.g. linear at 30/50/70 or TIES density 0.3/0.5/0.7) was run; based on the 4B method-vs-method delta of <0.5 pp, the expected information gain from a weight sweep was deemed low compared to the GPU cost.
  • hh_rlhf_no ran on a subset. Safety / preference accuracy was scored on 6 selected merges (12b-std/open ties, 4b-std/open ties, 1b-std ties) plus the baseline. Results show pref-accuracy hovering around 50-52% across the whole set — i.e. the merges did not measurably regress alignment vs the base + LoRA condition.
  • Standalone base baselines pending. As of 2026-04-29, 6 standalone posttrain-base evals are queued on Olivia (SLURM arrays 597645 + 597646) to provide proper per-size baselines for clean delta computation across all sizes; once they land, an updated ablation will replace the current 4B-only one in the report.

Source adapters and bases

  • Adapters: NbAiLab/nb-gpt-gemma3-bakery-adapters-2603
  • Posttrain bases: NbAiLab/nb-gpt-gemma3-{270m,1b,4b,12b,27b}-instruct-epoch-3-aurora-sft-2603[-open]-posttrain
  • Mergekit container: nbailab/mergekit:olivia-gh200 (Docker Hub, arm64)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NbAiLab/nb-gpt-gemma3-bakery-merges-2603