GigaAM (Giga Acoustic Model) is a Conformer-based wav2vec2 foundational model (around 240M parameters). We trained GigaAM on nearly 50 thousand hours of diversified speech audio in the Russian language.
Resources:
We fine-tuned the GigaAM encoder for Speech Recognition with two different decoders:
- GigaAM-CTC was fine-tunined with Connectionist Temporal Classification and character-based tokenizer.
- GigaAM-RNNT was fine-tuned with RNN Transducer loss and subword tokenizer.
Both models were trained using the NeMo toolkit on publicly available Russian labeled data:
dataset | size, hours | weight |
---|---|---|
Golos | 1227 | 0.6 |
SOVA | 369 | 0.2 |
Russian Common Voice | 207 | 0.1 |
Russian LibriSpeech | 93 | 0.1 |
Resources:
The following table summarizes the performance of different models in terms of Word Error Rate on open Russian datasets:
model | parameters | Golos Crowd | Golos Farfield | OpenSTT Youtube | OpenSTT Phone calls | OpenSTT Audiobooks | Mozilla Common Voice | Russian LibriSpeech |
---|---|---|---|---|---|---|---|---|
Whisper-large-v3 | 1.5B | 17.4 | 14.5 | 21.1 | 31.2 | 17.0 | 5.3 | 9.0 |
NVIDIA Ru-FastConformer-RNNT | 115M | 2.6 | 6.6 | 23.8 | 32.9 | 16.4 | 2.7 | 11.6 |
GigaAM-CTC | 242M | 3.1 | 5.7 | 18.4 | 25.6 | 15.1 | 1.7 | 8.1 |
GigaAM-RNNT | 243M | 2.3 | 4.4 | 16.7 | 22.9 | 13.9 | 0.9 | 7.4 |
GigaAM-Emo is an acoustic model for Emotion Recognition. We fine-tuned the GigaAM Encoder on the Dusha dataset.
Resources:
The following table summarizes the performance of different models on the Dusha dataset:
Crowd | Podcast | |||||
---|---|---|---|---|---|---|
Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | |
DUSHA baseline (MobileNetV2 + Self-Attention) |
0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |
АБК (TIM-Net) | 0.84 | 0.77 | 0.78 | 0.90 | 0.50 | 0.55 |
GigaAM-Emo | 0.90 | 0.87 | 0.84 | 0.90 | 0.76 | 0.67 |