Yuriy Perezhohin PRO

yuriyvnv

https://scholar.google.com/citations?user=I5uzFtwAAAAJ&hl=en

AI & ML interests

Automatic Speech Recognition, Embeddings, Code Generation, Synthetic Data Generation and Filtering

Recent Activity

updated a model 4 days ago

yuriyvnv/Qwen3-ASR-1.7B-PT

updated a model 4 days ago

yuriyvnv/Qwen3-ASR-1.7B-NL

published a model 4 days ago

yuriyvnv/Qwen3-ASR-1.7B-NL

View all activity

Organizations

updated 2 models 4 days ago

yuriyvnv/Qwen3-ASR-1.7B-PT

Automatic Speech Recognition • 2B • Updated 4 days ago • 143 • 1

yuriyvnv/Qwen3-ASR-1.7B-NL

Automatic Speech Recognition • 2B • Updated 4 days ago • 65

published a model 4 days ago

yuriyvnv/Qwen3-ASR-1.7B-NL

Automatic Speech Recognition • 2B • Updated 4 days ago • 65

published a model 6 days ago

yuriyvnv/Qwen3-ASR-1.7B-PT

Automatic Speech Recognition • 2B • Updated 4 days ago • 143 • 1

updated a collection 6 days ago

Best Fine-Tuned ASR Models

Collection

This collection serves to reflect the best models fine-tuned during several experiments in the task of Automatic Speech Recognition. • 6 items • Updated 6 days ago

replied to their post 6 days ago

Thanks! Just pushed the repo public: github.com/yuriyvnv/TTS-Augmented-ASR

This is the codebase behind a paper I wrote on Estonian and Slovenian, so you'll find the full pipeline there: not just the Parakeet fine-tuning scripts, but also the synthetic data generation (LLM text diversification + OpenAI TTS synthesis) that powers the augmentation. Everything was trained on a single NVIDIA H100.

One thing worth knowing for African languages:

Parakeet v3 is only pretrained on 25 languages, so you'd be doing cross-lingual transfer from scratch. The base won't recognize the language zero-shot, but fine-tuning still works — just expect a much rougher starting point than what you saw in my models.
Always evaluate zero-shot first. I had one language (Polish) where fine-tuning actually made things worse due to domain mismatch, or the learning rate was too low (still analyzing why this happened).
Standard recipe worked across everything I tried: AdamW, lr=5e-5, cosine annealing, 10% warmup, bf16, batch 32-64, early stopping on val_wer. The larger the batch size, especially for parakeet models, the better the gradient flow during training, since the model is compact.
Happy to help if you hit anything weird.

liked a model 7 days ago

Qwen/Qwen3-ASR-1.7B

Automatic Speech Recognition • 2B • Updated Jan 30 • 1.81M • 762

updated a collection 7 days ago

Best Fine-Tuned ASR Models

Collection

This collection serves to reflect the best models fine-tuned during several experiments in the task of Automatic Speech Recognition. • 6 items • Updated 6 days ago

posted an update 8 days ago

Post

601

🎙️Parakeet-TDT Fine Tuning: 4 New ASR Models

Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian — among the first community fine-tunes of this architecture for the aforementioned languages

📊 Results on Common Voice 17 test sets:

🇸🇮 Slovenian: 50.49% → 11.56% WER (-77%)
🇵🇹 Portuguese: 15.86% → 10.71% WER (-32%)
🇪🇪 Estonian: 27.15% → 21.03% WER (-23%)
🇳🇱 Dutch: 5.99% → 5.33% WER (-11%)

All models output cased text with punctuation.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)

🔗 Models:
🇳🇱 yuriyvnv/parakeet-tdt-0.6b-dutch
🇵🇹 yuriyvnv/parakeet-tdt-0.6b-portuguese
🇪🇪 yuriyvnv/parakeet-tdt-0.6b-estonian
🇸🇮 yuriyvnv/parakeet-tdt-0.6b-slovenian

🏗️ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.

@hf-audio @NVIDIADev

#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice