fusion_gttbsc_distilbert-uncased-ft

Ground truth text with prosody encoding and ASR encoding residual cross attention fusion multi-label DAC

Model description

ASR encoder: Whisper small encoder
Prosody encoder: 2 layer transformer encoder with initial dense projection
Backbone: DistilBert uncased
Fusion: 2 residual cross attention fusion layers (F_asr x F_text and F_prosody x F_text) with dense layer on top
Pooling: Self attention
Multi-label classification head: 2 dense layers with two dropouts 0.3 and Tanh activation inbetween

Training and evaluation data

Trained on ground truth.
Evaluated on ground truth (GT) and normalized Whisper small transcripts (E2E).

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0007
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 20
mixed_precision_training: Native AMP

Framework versions

Transformers 4.41.2
Pytorch 2.3.0+cu121
Datasets 2.19.2
Tokenizers 0.19.1

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Masioki/fusion_gttbsc_distilbert-uncased-ft

Evaluation results

F1 macro E2E on asapp/slue-phase-2
self-reported

TBA
F1 macro GT on asapp/slue-phase-2
self-reported

TBA