Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)

A composed Singapore-English ASR model that connects the MERaLiON-3 speech encoder to a BF16 Gemma-4-E4B decoder through a trained projector and rank-16 speech LoRA.

This BF16 release is the recommended quality-first edition: it keeps the decoder in native bfloat16, avoids quantization artifacts, and improves the standalone MERaLiON-3 baseline by 9.69 WER points on the MNSC ASR Part 2 test set.

Important: this is a composed MLX bundle, not a vanilla transformers.pipeline checkpoint. Use the elderwise runtime (or equivalent wiring) to connect speech_encoder/, projector/, decoder/, and lora/.

Result summary

Evaluated on MERaLiON Multitask National Speech Corpus v1 — ASR Part 2 Test (3000 utterance-level clips).

System WER ↓ Notes
MERaLiON-3 baseline 25.78% stock MERaLiON-3 encoder + native decoder
8-bit Gemma-4 + MERaLiON speech LoRA 18.86% smaller sibling release
This BF16 release 16.09% best-quality bundle
  • Absolute improvement vs. MERaLiON-3 baseline: −9.69pp
  • Absolute improvement vs. 8-bit sibling: −2.77pp
  • Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed, speaker-prefix tags removed from reference and hypothesis.

Example outputs

These are actual model outputs from artifacts/run_3000_bf16_r16mlp/eval_predictions.jsonl, selected from the held-out MNSC ASR Part 2 test set. Each row scores 0% WER under the release normalizer (lowercase, punctuation removed, whitespace collapsed).

# Reference Model output WER
1 There IS A Food Court Selling Chicken Pasta behind Delmas' House There is a food court selling Chicken Pasta behind Delma's house. 0%
2 what is the distance to The Seletar Mall What is the distance to The Seletar Mall? 0%
3 Number sequence IS S seven six nine Zero four one three A and Date of birth IS thirteen September nineteen seventy seven Number sequence is S. seven, six, nine, zero, four, one, three, A, and date of birth is thirteen, September, nineteen seventy seven. 0%
4 six nine eight four four six eight three five three Six, nine, eight, four, four, six, eight, three, five, three. 0%
5 eight five six four one seven four five Eight, five, six, four, one, seven, four, five. 0%
6 Pita is a Traditional Local Cuisine Pita is a traditional local cuisine. 0%
7 it is faster to take the bus to Jalan Asas It is faster to take the bus to Jalan Asas. 0%
8 a new television show documented the lives of various people including Syed Sheikh Syed Ahmad Al Hadi and Lucien Wang A new television show documented the lives of various people, including Syed Sheikh Syed Ahmad Al Hadi and Lucien Wang. 0%
9 Hiyashi Chuka Takikomi Gohan and Fugu Hiyashi Chuka Takikomi Gohan and Fugu. 0%
10 where can I get cheap food in Kathmandu Where can I get Cheap Food in Kathmandu? 0%

Across the full saved evaluation file, 1071 / 3000 utterances scored 0% WER, and another 857 scored ≤20% WER under the same normalizer.

What is inside

Path Contents Precision
decoder/ Gemma-4-E4B instruction decoder, MLX format bfloat16
speech_encoder/ MERaLiON-3 acoustic encoder + frame adaptor fp16
projector/ LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm fp32
lora/ rank-16 speech-alignment LoRA adapters + lora_config.json fp32
config.json composition manifest JSON
PROVENANCE.md chain of custody, evaluation, license notes Markdown

The speech path is:

audio -> Whisper-style log-mel -> MERaLiON-3 encoder/adaptor -> 3584-d speech embeddings
      -> projector -> 2560-d Gemma embedding space -> Gemma-4-E4B BF16 + speech LoRA -> text

Quickstart

Install or clone the elderwise runtime that wires the components together:

pip install git+https://github.com/ajentik/elderwise-mlx.git
# or: git clone https://github.com/ajentik/elderwise-mlx && pip install -e elderwise-mlx

Then load the composed bundle:

from pathlib import Path

from elderwise.inference import load_pipeline, transcribe_with_pipeline
from huggingface_hub import snapshot_download

bundle = Path(snapshot_download("majentik/gemma-4-e4b-mlx-elderwise-MERaLiON"))

pipeline = load_pipeline(
    meralion_dir=str(bundle / "speech_encoder"),
    gemma_id=str(bundle / "decoder"),
    projector_path=str(bundle / "projector"),
    lora_path=str(bundle / "lora"),
    lora_rank=16,
    lora_target_names=(
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ),
)

text = transcribe_with_pipeline(pipeline, "your_audio.wav", max_tokens=128)
print(text)

Runtime notes:

  • lora_path should point to the directory containing adapters.safetensors (lora/), not to the file itself.
  • The target module list must match the adapter: q/k/v/o/gate/up/down across all 42 decoder layers.
  • Use the prompt Transcribe the following audio: unless you intentionally fine-tune/evaluate a different prompt contract.
  • The speech LoRA is switchable in the runtime: enable speech mode for ASR, disable/scale to 0.0 for plain text generation.

Intended use

Good fits:

  • Singapore English / Singlish automatic speech recognition
  • utterance-level voice notes, routing, search, and agent input
  • MLX-native speech-language research with a shared text decoder

Not intended for:

  • safety-critical or legal/medical transcription
  • diarization, timestamps, speaker identification, or streaming ASR
  • Mandarin-only ASR; a separate switchable Mandarin LoRA is planned

Limitations

  • The LoRA is specialized for Singapore English. Other accents and languages may degrade.
  • Residual errors mostly cluster around rare or ambiguous proper nouns, especially code-switched names and places.
  • Long-form audio was not the optimization target; split long recordings into utterance-sized chunks.
  • This repo is a composed bundle. Generic hub inference widgets will not know how to run it without the elderwise runtime.

Architecture details

  • Speech encoder output dimension: 3584
  • Projector hidden dimension: 3072
  • Decoder embedding dimension: 2560
  • Decoder depth: 42 layers
  • LoRA rank: 16
  • LoRA targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Speech-mode LoRA scale used by the release runtime: 20.0

Gemma-4's per-layer embedding side channel is handled in the runtime by supplying explicit per-layer inputs for speech positions instead of forcing speech embeddings through token nearest-neighbor recovery.

Provenance and licenses

See PROVENANCE.md for the full chain of custody. Summary:

  • Decoder: google/gemma-4-E4B-it, converted to MLX bfloat16; Gemma Terms of Use apply.
  • Speech tower: MERaLiON/MERaLiON-3-10B; MERaLiON release terms apply.
  • Training data source: MERaLiON/Multitask-National-Speech-Corpus-v1; MNSC terms apply.
  • Projector + LoRA: trained alignment components for this composition; distributed with the same upstream obligations.

Internal optimization recipe and hardware details are intentionally omitted from the public package.

Citation

@misc{gemma4_meralion_bf16_speech_lora_mlx_2026,
  title  = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
  author = {majentik},
  year   = {2026},
  url    = {https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}
}

Related releases

Downloads last month
76
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gemma-4-e4b-mlx-elderwise-MERaLiON

Adapter
(101)
this model

Dataset used to train majentik/gemma-4-e4b-mlx-elderwise-MERaLiON

Evaluation results