The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. Refer to the paper for more details.
We only train the linear projector in this recipe.
Encoder | Projector | LLM | test-clean | test-other |
---|---|---|---|---|
WavLM-large | Linear(~18.88M) | vicuna-7b-v1.5 | 2.28 | 4.78 |
hubert_xtralarge_ll60k_finetune_ls960 | Linear(~21.50M) | vicuna-7b-v1.5 | 1.84 | 3.39 |
You need to prepare the data jsonl in this format.
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."}
...
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."}
bash decode_wavlm_large_linear_vicuna_7b.sh
Modify the path including speech_encoder_path
, llm_path
, output_dir
, ckpt_path
, val_data_path
and decode_log
in the script when you run the shell script.
bash finetune_whisper_large_linear_vicuna_7b.sh
Whisper takes mel as input. Pay attention to the key dataset_config.mel_size
for different version of the whisper model family.
bash finetune_wavlm_large_linear_vicuna_7b.sh
WavLM takes raw wavform as input. Pay attention to the key dataset_config.normalize
and model_config.normalize
for different version of the SSL models for different SSL models are different in these keys.
Note:
- if you are running on a machine with multiple GPUs please make sure to only make one of them visible using
export CUDA_VISIBLE_DEVICES=GPU:id
- If you want to run with FSDP, you can set
++train_config.enable_fsdp=true
and++train_config.enable_ddp=false
.
Setting use_fast_kernels
will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in optimum
library from HuggingFace as a one-liner API, please read more here.
If you are interested in running full parameter fine-tuning on the 70B model, you can enable low_cpu_fsdp
mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with BF16
on 16xA100, 80GB GPUs.
You can refer to the paper for more results.
@article{ma2024embarrassingly,
title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},
author={Ma, Ziyang and Yang, Guanrou and Yang, Yifan and Gao, Zhifu and Wang, Jiaming and Du, Zhihao and Yu, Fan and Chen, Qian and Zheng, Siqi and Zhang, Shiliang and others},
journal={arXiv preprint arXiv:2402.08846},
year={2024}
}