ByT5 Leetspeak Decoder V2
Translates leetspeak, internet slang, and gaming abbreviations back to clean English.
Built on google/byt5-base. V2 trained on real Reddit comments for improved slang handling.
Performance
| Metric | V1 | V2 |
|---|---|---|
| Accuracy | 71% | 85% |
| Training Data | WikiText (synthetic) | Reddit (real) |
Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("ilyyeees/byt5-leetspeak-decoder-v2")
tokenizer = AutoTokenizer.from_pretrained("ilyyeees/byt5-leetspeak-decoder-v2")
def translate(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Examples
print(translate("idk wh4t 2 d0 tbh")) # I don't know what to do to be honest.
print(translate("c u l8r m8")) # See you later mate.
print(translate("brb in 10")) # be right back in 10
print(translate("g2g l8r m8")) # got to go later mate
print(translate("1 h4v3 2 c4ts")) # I have 2 cats
What It Handles
- Leetspeak:
h3ll0 w0rld→hello world - Slang:
tbh,idk,rn,ngl,afk - Gaming:
gg wp,brb,g2g,1v1 - Numbers: Preserves real numbers (
2 catsstays2 cats) - Context:
2 late→too latevs2 cats→2 cats
Training
- Base:
google/byt5-base(580M params) - V1: WikiText + SAMSum + synthetic corruption
- V2: Real Reddit comments (5k) + Qwen 2.5 32B translations + continued training
Links
- Downloads last month
- 35