Add caveats: UDS circularity, ALT in-domain, MRR significance

a3c9528 verified 2 months ago

5.86 kB

	---
	language: km
	license: apache-2.0
	library_name: sentencepiece
	tags:
	- tokenizer
	- khmer
	- sentencepiece
	- graph-regularization
	- low-resource
	- southeast-asian
	- cambodia
	pipeline_tag: feature-extraction
	datasets:
	- khmer-corpus-648mb
	metrics:
	- accuracy
	- f1
	model-index:
	- name: Tokkonizer-KM V3f
	results:
	- task:
	type: tokenization
	name: Khmer Tokenization
	metrics:
	- type: tokens-per-character
	value: 0.293
	name: TPC (Khmer)
	- type: accuracy
	value: 93.33
	name: Sanskrit/Pali Preservation
	- type: f1
	value: 99.94
	name: ALT Segmentation F1
	widget:
	- text: "ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
	example_title: "Buddhism importance"
	- text: "ព្រះរាជាណាចក្រកម្ពុជា"
	example_title: "Kingdom of Cambodia"
	- text: "នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
	example_title: "PM delivered speech"
	- text: "ធម៌ កម្ម និព្វាន សង្ឃ បុណ្យ"
	example_title: "Buddhist terms (Pali/Sanskrit)"
	- text: "សង្រ្គាមនៅមជ្ឈិមបូព៌ាបានបង្កផលប៉ះពាល់យ៉ាងធ្ងន់ធ្ងរ"
	example_title: "Geopolitical news"
	- text: "ស្រឡាញ់បងណាស់"
	example_title: "Love you so much"
	---

	# Tokkonizer-KM V3f

	A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with 31x smaller vocabulary.

	Live Demo: [angkor-ai.com/labs](https://angkor-ai.com/labs)

	## Tokenization Examples

	```
	"ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
	→ [▁ \| ព្រះពុទ្ធសាសនា \| មានសារៈសំខាន់ \| ។]
	4 tokens, TPC 0.143 ✅

	"នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
	→ [▁នាយករដ្ឋមន្ត្រី \| បានថ្លែង \| សុន្ទរកថា]
	3 tokens, TPC 0.094 ✅

	Sanskrit/Pali: ធម៌ → 1 token ✅ \| កម្ម → 1 token ✅ \| និព្វាន → 1 token ✅
	```

	## Performance

	\| Metric \| V3f (8K) \| mT5 (250K) \| XLM-R (250K) \|
	\|--------\|:---:\|:---:\|:---:\|
	\| TPC (Khmer) \| 0.293 \| 0.348 \| 0.327 \|
	\| Sanskrit/Pali \| 93.3% \| 21.4% \| 28.6% \|
	\| Cultural preservation \| 91.7% \| 75.0% \| 91.7% \|
	\| UNK rate \| 0% \| 0% \| 0% \|
	\| Lossless round-trip \| Yes \| No \| No \|
	\| Speed \| 15M/s \| 3.3M/s \| 2.8M/s \|
	\| ALT F1 (5K sentences) \| 99.94% \| — \| — \|

	## Intended Uses

	- Khmer text preprocessing for NLP pipelines
	- Semantic search / RAG over Khmer documents
	- Keyboard prediction engine
	- Spell checking (with companion lexicon)

	Not intended for: text generation, translation, non-Khmer languages.

	## How to Use

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer")
	tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា")
	decoded = tokenizer.decode(tokens) # 100% lossless
	```

	Or with SentencePiece directly:
	```python
	import sentencepiece as spm
	sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
	pieces = sp.encode("កម្ពុជា", out_type=str) # ["▁", "កម្ពុជា"]
	```

	## Training

	- Algorithm: SentencePiece Unigram
	- Vocabulary: 8,000 tokens
	- Corpus: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts
	- Character coverage: 1.0 (full Khmer Unicode)
	- User-defined symbols: 7 Sanskrit/Pali terms
	- Key finding: 7 UDS outperformed 500 UDS — less intervention = better results
	- Hardware: Apple M3 Pro, ~30 min training
	- CO2: negligible (CPU only)

	## Graph Regularization (Layer 2)

	When paired with graph-regularized GPT-2 (separate model):

	\| Metric \| Baseline \| Graph-Reg \|
	\|--------\|:---:\|:---:\|
	\| Coherence@10 \| 0.32% \| 15.5% (48x) \|
	\| Collapse \| 0% \| 0.2% \|
	\| Perplexity cost \| — \| +2.8% \|
	\| Retrieval MRR \| 0.417 \| 0.460 (+10.4%) \|

	## Companion: Khmer NLP Engine (26MB SQLite)

	A complete prediction + correction + emoji engine built on this tokenizer:
	- 60K word-pair predictions (IDF-weighted)
	- 28K phrase predictions
	- 12,677 validated words (spell check)
	- 552 romanization mappings (Latin→Khmer)
	- 400 contextual emoji suggestions
	- 282 consonant cluster validations

	Demo: [angkor-ai.com/labs](https://angkor-ai.com/labs)

	## Limitations & Caveats

	- Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8).
	- ALT F1 in-domain: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%.
	- Retrieval MRR: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs).
	- Grapheme break rate: 1.08% (target 1.0%)
	- Corpus bias: formal/news text overrepresented vs conversational
	- Foreign names fragment into individual characters
	- සමាធិ (samadhi) is the only Sanskrit term that still fragments

	## Version History

	\| Version \| Vocab \| TPC \| Status \|
	\|---------\|:---:\|:---:\|---\|
	\| V6.5 (Aug 2025) \| 32K \| 0.664 \| Failed \|
	\| V7 (Sep 2025) \| 16K \| 0.294 \| Deployed \|
	\| V3f (Mar 2026) \| 8K \| 0.293 \| Production \|

	## Citation

	```bibtex
	@software{delrieu2026tokkonizer,
	author = {Delrieu, Nicolas},
	title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer},
	year = {2026},
	url = {https://github.com/khopilot/tokkonizer-km}
	}
	```

	## Contact

	- [angkor-ai.com](https://angkor-ai.com)
	- nicolasdelrieu.services@gmail.com