Yuriy Perezhohin PRO
AI & ML interests
Recent Activity
Organizations
yuriyvnv/Qwen3-ASR-1.7B-NL
Thanks! Just pushed the repo public: github.com/yuriyvnv/TTS-Augmented-ASR
This is the codebase behind a paper I wrote on Estonian and Slovenian, so you'll find the full pipeline there: not just the Parakeet fine-tuning scripts, but also the synthetic data generation (LLM text diversification + OpenAI TTS synthesis) that powers the augmentation. Everything was trained on a single NVIDIA H100.
One thing worth knowing for African languages:
Parakeet v3 is only pretrained on 25 languages, so you'd be doing cross-lingual transfer from scratch. The base won't recognize the language zero-shot, but fine-tuning still works โ just expect a much rougher starting point than what you saw in my models.
Always evaluate zero-shot first. I had one language (Polish) where fine-tuning actually made things worse due to domain mismatch, or the learning rate was too low (still analyzing why this happened).
Standard recipe worked across everything I tried: AdamW, lr=5e-5, cosine annealing, 10% warmup, bf16, batch 32-64, early stopping on val_wer. The larger the batch size, especially for parakeet models, the better the gradient flow during training, since the model is compact.
Happy to help if you hit anything weird.
Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian โ among the first community fine-tunes of this architecture for the aforementioned languages
๐ Results on Common Voice 17 test sets:
๐ธ๐ฎ Slovenian: 50.49% โ 11.56% WER (-77%)
๐ต๐น Portuguese: 15.86% โ 10.71% WER (-32%)
๐ช๐ช Estonian: 27.15% โ 21.03% WER (-23%)
๐ณ๐ฑ Dutch: 5.99% โ 5.33% WER (-11%)
All models output cased text with punctuation.
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained(
"yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)๐ Models:
๐ณ๐ฑ yuriyvnv/parakeet-tdt-0.6b-dutch
๐ต๐น yuriyvnv/parakeet-tdt-0.6b-portuguese
๐ช๐ช yuriyvnv/parakeet-tdt-0.6b-estonian
๐ธ๐ฎ yuriyvnv/parakeet-tdt-0.6b-slovenian
๐๏ธ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.
@hf-audio @NVIDIADev
#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice