Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)

A composed Singapore-English ASR model that connects the MERaLiON-3 speech encoder to a BF16 Gemma-4-E4B decoder through a trained projector and rank-16 speech LoRA.

This BF16 release is the recommended quality-first edition: it keeps the decoder in native bfloat16, avoids quantization artifacts, and improves the standalone MERaLiON-3 baseline by 9.69 WER points on the MNSC ASR Part 2 test set.

Important: this is a composed MLX bundle, not a vanilla transformers.pipeline checkpoint. Use the elderwise runtime (or equivalent wiring) to connect speech_encoder/, projector/, decoder/, and lora/.

Result summary

Evaluated on MERaLiON Multitask National Speech Corpus v1 — ASR Part 2 Test (3000 utterance-level clips).

System	WER ↓	Notes
MERaLiON-3 baseline	25.78%	stock MERaLiON-3 encoder + native decoder
8-bit Gemma-4 + MERaLiON speech LoRA	18.86%	smaller sibling release
This BF16 release	16.09%	best-quality bundle

Absolute improvement vs. MERaLiON-3 baseline: −9.69pp
Absolute improvement vs. 8-bit sibling: −2.77pp
Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed, speaker-prefix tags removed from reference and hypothesis.

Example outputs

These are actual model outputs from artifacts/run_3000_bf16_r16mlp/eval_predictions.jsonl, selected from the held-out MNSC ASR Part 2 test set. Each row scores 0% WER under the release normalizer (lowercase, punctuation removed, whitespace collapsed).

#	Reference	Model output	WER
1	There IS A Food Court Selling Chicken Pasta behind Delmas' House	There is a food court selling Chicken Pasta behind Delma's house.	0%
2	what is the distance to The Seletar Mall	What is the distance to The Seletar Mall?	0%
3	Number sequence IS S seven six nine Zero four one three A and Date of birth IS thirteen September nineteen seventy seven	Number sequence is S. seven, six, nine, zero, four, one, three, A, and date of birth is thirteen, September, nineteen seventy seven.	0%
4	six nine eight four four six eight three five three	Six, nine, eight, four, four, six, eight, three, five, three.	0%
5	eight five six four one seven four five	Eight, five, six, four, one, seven, four, five.	0%
6	Pita is a Traditional Local Cuisine	Pita is a traditional local cuisine.	0%
7	it is faster to take the bus to Jalan Asas	It is faster to take the bus to Jalan Asas.	0%
8	a new television show documented the lives of various people including Syed Sheikh Syed Ahmad Al Hadi and Lucien Wang	A new television show documented the lives of various people, including Syed Sheikh Syed Ahmad Al Hadi and Lucien Wang.	0%
9	Hiyashi Chuka Takikomi Gohan and Fugu	Hiyashi Chuka Takikomi Gohan and Fugu.	0%
10	where can I get cheap food in Kathmandu	Where can I get Cheap Food in Kathmandu?	0%

Across the full saved evaluation file, 1071 / 3000 utterances scored 0% WER, and another 857 scored ≤20% WER under the same normalizer.

What is inside

Path	Contents	Precision
`decoder/`	Gemma-4-E4B instruction decoder, MLX format	bfloat16
`speech_encoder/`	MERaLiON-3 acoustic encoder + frame adaptor	fp16
`projector/`	`LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm`	fp32
`lora/`	rank-16 speech-alignment LoRA adapters + `lora_config.json`	fp32
`config.json`	composition manifest	JSON
`PROVENANCE.md`	chain of custody, evaluation, license notes	Markdown

The speech path is:

audio -> Whisper-style log-mel -> MERaLiON-3 encoder/adaptor -> 3584-d speech embeddings
      -> projector -> 2560-d Gemma embedding space -> Gemma-4-E4B BF16 + speech LoRA -> text

Quickstart

Install or clone the elderwise runtime that wires the components together:

pip install git+https://github.com/ajentik/elderwise-mlx.git
# or: git clone https://github.com/ajentik/elderwise-mlx && pip install -e elderwise-mlx

Then load the composed bundle:

from pathlib import Path

from elderwise.inference import load_pipeline, transcribe_with_pipeline
from huggingface_hub import snapshot_download

bundle = Path(snapshot_download("majentik/gemma-4-e4b-mlx-elderwise-MERaLiON"))

pipeline = load_pipeline(
    meralion_dir=str(bundle / "speech_encoder"),
    gemma_id=str(bundle / "decoder"),
    projector_path=str(bundle / "projector"),
    lora_path=str(bundle / "lora"),
    lora_rank=16,
    lora_target_names=(
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ),
)

text = transcribe_with_pipeline(pipeline, "your_audio.wav", max_tokens=128)
print(text)

Runtime notes:

lora_path should point to the directory containing adapters.safetensors (lora/), not to the file itself.
The target module list must match the adapter: q/k/v/o/gate/up/down across all 42 decoder layers.
Use the prompt Transcribe the following audio: unless you intentionally fine-tune/evaluate a different prompt contract.
The speech LoRA is switchable in the runtime: enable speech mode for ASR, disable/scale to 0.0 for plain text generation.

Intended use

Good fits:

Singapore English / Singlish automatic speech recognition
utterance-level voice notes, routing, search, and agent input
MLX-native speech-language research with a shared text decoder

Not intended for:

safety-critical or legal/medical transcription
diarization, timestamps, speaker identification, or streaming ASR
Mandarin-only ASR; a separate switchable Mandarin LoRA is planned

Limitations

The LoRA is specialized for Singapore English. Other accents and languages may degrade.
Residual errors mostly cluster around rare or ambiguous proper nouns, especially code-switched names and places.
Long-form audio was not the optimization target; split long recordings into utterance-sized chunks.
This repo is a composed bundle. Generic hub inference widgets will not know how to run it without the elderwise runtime.

Architecture details

Speech encoder output dimension: 3584
Projector hidden dimension: 3072
Decoder embedding dimension: 2560
Decoder depth: 42 layers
LoRA rank: 16
LoRA targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Speech-mode LoRA scale used by the release runtime: 20.0

Gemma-4's per-layer embedding side channel is handled in the runtime by supplying explicit per-layer inputs for speech positions instead of forcing speech embeddings through token nearest-neighbor recovery.

Provenance and licenses

See PROVENANCE.md for the full chain of custody. Summary:

Decoder: google/gemma-4-E4B-it, converted to MLX bfloat16; Gemma Terms of Use apply.
Speech tower: MERaLiON/MERaLiON-3-10B; MERaLiON release terms apply.
Training data source: MERaLiON/Multitask-National-Speech-Corpus-v1; MNSC terms apply.
Projector + LoRA: trained alignment components for this composition; distributed with the same upstream obligations.

Internal optimization recipe and hardware details are intentionally omitted from the public package.

Citation

@misc{gemma4_meralion_bf16_speech_lora_mlx_2026,
  title  = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
  author = {majentik},
  year   = {2026},
  url    = {https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}
}

Related releases

8-bit sibling: majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX — smaller, 18.86% WER.
This BF16 edition is the recommended release for best transcription quality.

Downloads last month: 76

MLX

Hardware compatibility

Quantized

Model tree for majentik/gemma-4-e4b-mlx-elderwise-MERaLiON

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Adapter

(101)

this model

Dataset used to train majentik/gemma-4-e4b-mlx-elderwise-MERaLiON

Evaluation results

WER on MNSC ASR Part 2 Test
test set self-reported

16.090