ByT5 Leetspeak Decoder V2

Translates leetspeak, internet slang, and gaming abbreviations back to clean English.

Built on google/byt5-base. V2 trained on real Reddit comments for improved slang handling.

Performance

Metric V1 V2
Accuracy 71% 85%
Training Data WikiText (synthetic) Reddit (real)

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("ilyyeees/byt5-leetspeak-decoder-v2")
tokenizer = AutoTokenizer.from_pretrained("ilyyeees/byt5-leetspeak-decoder-v2")

def translate(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=256)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
print(translate("idk wh4t 2 d0 tbh"))  # I don't know what to do to be honest.
print(translate("c u l8r m8"))         # See you later mate.
print(translate("brb in 10"))          # be right back in 10
print(translate("g2g l8r m8"))         # got to go later mate
print(translate("1 h4v3 2 c4ts"))      # I have 2 cats

What It Handles

  • Leetspeak: h3ll0 w0rld → hello world
  • Slang: tbh, idk, rn, ngl, afk
  • Gaming: gg wp, brb, g2g, 1v1
  • Numbers: Preserves real numbers (2 cats stays 2 cats)
  • Context: 2 late → too late vs 2 cats → 2 cats

Training

  • Base: google/byt5-base (580M params)
  • V1: WikiText + SAMSum + synthetic corruption
  • V2: Real Reddit comments (5k) + Qwen 2.5 32B translations + continued training

Links

Downloads last month
35
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ilyyeees/byt5-leetspeak-decoder-v2