This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we continue training with CoVoST2 Dataset / CoVoST2-Ko for AST.

Evaluation

Evaluation was done on the following datasets:

ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from here.

Model	zeroth-CER	zeroth-WER	fleurs-ko_en-BLEU	fleurs-ko_en-cot-BLEU	fleurs-en_ko-BLEU	fleurs-en_ko-cot-BLEU
original	198.32	-	5.63	2.42	6.86	4.17
daekeun-ml/Phi-4-multimodal-finetune-ko-speech	1.61	3.54	7.67	8.38	12.31	9.69
seastar105/Phi-4-mm-inst-zeroth-kor	7.02	-	7.07	9.19	13.08	9.35
ASR finetune(this model)	1.31	2.95	7.46	6.24	12.15	8.91
+ 1 epoch finetune with Covost-Ko	3.88	-	8.07	10.09	18.82	15.41
AST finetuned model	1.77	2.99	8.01	9.09	17.09	11.82

Safetensors

Model size

6B params

Tensor type

F32

Model tree for junnei/Phi-4-multimodal-instruct-ko-asr

Base model

Finetuned

(49)

this model