| --- |
| language: km |
| license: apache-2.0 |
| library_name: sentencepiece |
| tags: |
| - tokenizer |
| - khmer |
| - sentencepiece |
| - graph-regularization |
| - low-resource |
| - southeast-asian |
| - cambodia |
| pipeline_tag: feature-extraction |
| datasets: |
| - khmer-corpus-648mb |
| metrics: |
| - accuracy |
| - f1 |
| model-index: |
| - name: Tokkonizer-KM V3f |
| results: |
| - task: |
| type: tokenization |
| name: Khmer Tokenization |
| metrics: |
| - type: tokens-per-character |
| value: 0.293 |
| name: TPC (Khmer) |
| - type: accuracy |
| value: 93.33 |
| name: Sanskrit/Pali Preservation |
| - type: f1 |
| value: 99.94 |
| name: ALT Segmentation F1 |
| widget: |
| - text: "ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។" |
| example_title: "Buddhism importance" |
| - text: "ព្រះរាជាណាចក្រកម្ពុជា" |
| example_title: "Kingdom of Cambodia" |
| - text: "នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា" |
| example_title: "PM delivered speech" |
| - text: "ធម៌ កម្ម និព្វាន សង្ឃ បុណ្យ" |
| example_title: "Buddhist terms (Pali/Sanskrit)" |
| - text: "សង្រ្គាមនៅមជ្ឈិមបូព៌ាបានបង្កផលប៉ះពាល់យ៉ាងធ្ងន់ធ្ងរ" |
| example_title: "Geopolitical news" |
| - text: "ស្រឡាញ់បងណាស់" |
| example_title: "Love you so much" |
| --- |
| |
| # Tokkonizer-KM V3f |
|
|
| A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with **31x smaller vocabulary**. |
|
|
| **Live Demo**: [angkor-ai.com/labs](https://angkor-ai.com/labs) |
|
|
| ## Tokenization Examples |
|
|
| ``` |
| "ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។" |
| → [▁ | ព្រះពុទ្ធសាសនា | មានសារៈសំខាន់ | ។] |
| 4 tokens, TPC 0.143 ✅ |
| |
| "នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា" |
| → [▁នាយករដ្ឋមន្ត្រី | បានថ្លែង | សុន្ទរកថា] |
| 3 tokens, TPC 0.094 ✅ |
| |
| Sanskrit/Pali: ធម៌ → 1 token ✅ | កម្ម → 1 token ✅ | និព្វាន → 1 token ✅ |
| ``` |
|
|
| ## Performance |
|
|
| | Metric | **V3f (8K)** | mT5 (250K) | XLM-R (250K) | |
| |--------|:---:|:---:|:---:| |
| | TPC (Khmer) | **0.293** | 0.348 | 0.327 | |
| | Sanskrit/Pali | **93.3%** | 21.4% | 28.6% | |
| | Cultural preservation | **91.7%** | 75.0% | 91.7% | |
| | UNK rate | **0%** | 0% | 0% | |
| | Lossless round-trip | **Yes** | No | No | |
| | Speed | **15M/s** | 3.3M/s | 2.8M/s | |
| | ALT F1 (5K sentences) | **99.94%** | — | — | |
|
|
| ## Intended Uses |
|
|
| - Khmer text preprocessing for NLP pipelines |
| - Semantic search / RAG over Khmer documents |
| - Keyboard prediction engine |
| - Spell checking (with companion lexicon) |
|
|
| **Not intended for**: text generation, translation, non-Khmer languages. |
|
|
| ## How to Use |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer") |
| tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា") |
| decoded = tokenizer.decode(tokens) # 100% lossless |
| ``` |
|
|
| Or with SentencePiece directly: |
| ```python |
| import sentencepiece as spm |
| sp = spm.SentencePieceProcessor(model_file="tokenizer.model") |
| pieces = sp.encode("កម្ពុជា", out_type=str) # ["▁", "កម្ពុជា"] |
| ``` |
|
|
| ## Training |
|
|
| - **Algorithm**: SentencePiece Unigram |
| - **Vocabulary**: 8,000 tokens |
| - **Corpus**: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts |
| - **Character coverage**: 1.0 (full Khmer Unicode) |
| - **User-defined symbols**: 7 Sanskrit/Pali terms |
| - **Key finding**: 7 UDS outperformed 500 UDS — less intervention = better results |
| - **Hardware**: Apple M3 Pro, ~30 min training |
| - **CO2**: negligible (CPU only) |
|
|
| ## Graph Regularization (Layer 2) |
|
|
| When paired with graph-regularized GPT-2 (separate model): |
|
|
| | Metric | Baseline | Graph-Reg | |
| |--------|:---:|:---:| |
| | Coherence@10 | 0.32% | **15.5%** (48x) | |
| | Collapse | 0% | 0.2% | |
| | Perplexity cost | — | +2.8% | |
| | Retrieval MRR | 0.417 | **0.460** (+10.4%) | |
|
|
| ## Companion: Khmer NLP Engine (26MB SQLite) |
|
|
| A complete prediction + correction + emoji engine built on this tokenizer: |
| - 60K word-pair predictions (IDF-weighted) |
| - 28K phrase predictions |
| - 12,677 validated words (spell check) |
| - 552 romanization mappings (Latin→Khmer) |
| - 400 contextual emoji suggestions |
| - 282 consonant cluster validations |
|
|
| Demo: [angkor-ai.com/labs](https://angkor-ai.com/labs) |
|
|
| ## Limitations & Caveats |
|
|
| - **Sanskrit/Pali circularity**: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8). |
| - **ALT F1 in-domain**: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%. |
| - **Retrieval MRR**: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs). |
| - Grapheme break rate: 1.08% (target 1.0%) |
| - Corpus bias: formal/news text overrepresented vs conversational |
| - Foreign names fragment into individual characters |
| - සමាធិ (samadhi) is the only Sanskrit term that still fragments |
|
|
| ## Version History |
|
|
| | Version | Vocab | TPC | Status | |
| |---------|:---:|:---:|---| |
| | V6.5 (Aug 2025) | 32K | 0.664 | Failed | |
| | V7 (Sep 2025) | 16K | 0.294 | Deployed | |
| | **V3f (Mar 2026)** | **8K** | **0.293** | **Production** | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{delrieu2026tokkonizer, |
| author = {Delrieu, Nicolas}, |
| title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer}, |
| year = {2026}, |
| url = {https://github.com/khopilot/tokkonizer-km} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| - [angkor-ai.com](https://angkor-ai.com) |
| - nicolasdelrieu.services@gmail.com |
|
|