ModernBERT-TR

ModernBERT-TR

A Modern Encoder Foundation Model for Turkish

Besher Alkurdi, Himmet Toprak Kesgin, Muzaffer Kaan Yuce, Mehmet Fatih Amasyali

Web Page · Paper (soon) · Training Code · Evaluation Code

Overview

ModernBERT-TR is a 150M-parameter Turkish encoder pretrained from scratch on 144.4B tokens using the ModernBERT architecture. It uses a custom 50K WordPiece tokenizer optimized for Turkish morphology.

Architecture: 22 layers, 768 hidden, 12 heads, RoPE, GLU, alternating local-global attention, Flash Attention, sequence-packed training.

Results

Frozen Linear Probing (11 Turkish NLP tasks)

Model Params Avg
ModernBERT-TR (ours) 150M 60.2
Turkish-E5-large 560M 53.2
mmBERT 307M 54.9
TabiBERT ~150M 49.1
BERTurk 111M 35.3

+13.1% relative over next-best. +70.3% relative over BERTurk. Outperforms models up to 4x larger.

TabiBench Full Fine-Tuning (28 tasks)

Model Params Avg
ModernBERT-TR (ours) 150M 77.28
TabiBERT ~150M 77.58
BERTurk 110M 75.96

Leads in 5/8 categories (text classification, STS, NLI, academic understanding, information retrieval). TabiBERT leads in code retrieval and QA (trained on code/math data).

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/modernbert-tr-base-1k")
model = AutoModelForMaskedLM.from_pretrained("ytu-ce-cosmos/modernbert-tr-base-1k")

text = "Türkiye'nin başkenti [MASK]'dır."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Training Details

Data FineWeb-2 Turkish (41.2B tokens) + BertTurk Corpus 5x (31.0B tokens) = 72.2B/epoch, 2 epochs
Tokenizer 50K WordPiece, trained on Turkish data
Optimizer StableAdamW, peak LR 2e-4, cosine schedule
Batch size 256 sequences (262K tokens/step)
MLM masking 30% (train) / 15% (eval)
Hardware 4x NVIDIA H100, 623 GPU-hours
Precision BF16 mixed precision
Context 1,024 tokens

Citation

@article{alkurdi2025modernberttr,
  title={ModernBERT-TR: A Modern Encoder Foundation Model for Turkish},
  author={Alkurdi, Besher and Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
  year={2025}
}

Acknowledgments

Supported by Yildiz Technical University (FDK-2024-6070) and TUBITAK (124E055). Built on the ModernBERT codebase with FineWeb-2 and BertTurk Corpus data.

Downloads last month
1,783
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ytu-ce-cosmos/modernbert-tr-base-1k

Evaluation results

  • Avg Score (Frozen Linear Probe) on Turkish NLP Benchmark (11 tasks)
    self-reported
    60.200
  • Avg Score (Full Fine-Tuning) on TabiBench
    self-reported
    77.280