ModernBERT-TR
A Modern Encoder Foundation Model for Turkish
Besher Alkurdi, Himmet Toprak Kesgin, Muzaffer Kaan Yuce, Mehmet Fatih Amasyali
Overview
ModernBERT-TR is a 150M-parameter Turkish encoder pretrained from scratch on 144.4B tokens using the ModernBERT architecture. It uses a custom 50K WordPiece tokenizer optimized for Turkish morphology.
Architecture: 22 layers, 768 hidden, 12 heads, RoPE, GLU, alternating local-global attention, Flash Attention, sequence-packed training.
Results
Frozen Linear Probing (11 Turkish NLP tasks)
| Model | Params | Avg |
|---|---|---|
| ModernBERT-TR (ours) | 150M | 60.2 |
| Turkish-E5-large | 560M | 53.2 |
| mmBERT | 307M | 54.9 |
| TabiBERT | ~150M | 49.1 |
| BERTurk | 111M | 35.3 |
+13.1% relative over next-best. +70.3% relative over BERTurk. Outperforms models up to 4x larger.
TabiBench Full Fine-Tuning (28 tasks)
| Model | Params | Avg |
|---|---|---|
| ModernBERT-TR (ours) | 150M | 77.28 |
| TabiBERT | ~150M | 77.58 |
| BERTurk | 110M | 75.96 |
Leads in 5/8 categories (text classification, STS, NLI, academic understanding, information retrieval). TabiBERT leads in code retrieval and QA (trained on code/math data).
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/modernbert-tr-base-1k")
model = AutoModelForMaskedLM.from_pretrained("ytu-ce-cosmos/modernbert-tr-base-1k")
text = "Türkiye'nin başkenti [MASK]'dır."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Training Details
| Data | FineWeb-2 Turkish (41.2B tokens) + BertTurk Corpus 5x (31.0B tokens) = 72.2B/epoch, 2 epochs |
| Tokenizer | 50K WordPiece, trained on Turkish data |
| Optimizer | StableAdamW, peak LR 2e-4, cosine schedule |
| Batch size | 256 sequences (262K tokens/step) |
| MLM masking | 30% (train) / 15% (eval) |
| Hardware | 4x NVIDIA H100, 623 GPU-hours |
| Precision | BF16 mixed precision |
| Context | 1,024 tokens |
Citation
@article{alkurdi2025modernberttr,
title={ModernBERT-TR: A Modern Encoder Foundation Model for Turkish},
author={Alkurdi, Besher and Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
year={2025}
}
Acknowledgments
Supported by Yildiz Technical University (FDK-2024-6070) and TUBITAK (124E055). Built on the ModernBERT codebase with FineWeb-2 and BertTurk Corpus data.
- Downloads last month
- 1,783
Dataset used to train ytu-ce-cosmos/modernbert-tr-base-1k
Evaluation results
- Avg Score (Frozen Linear Probe) on Turkish NLP Benchmark (11 tasks)self-reported60.200
- Avg Score (Full Fine-Tuning) on TabiBenchself-reported77.280