ASR_Librispeech

Model Architecture

The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. Refer to the paper for more details.

Performance and checkpoints

We only train the linear projector in this recipe.

Encoder	Projector	LLM	test-clean	test-other
WavLM-large	Linear(~18.88M)	vicuna-7b-v1.5	2.28	4.78
hubert_xtralarge_ll60k_finetune_ls960	Linear(~21.50M)	vicuna-7b-v1.5	1.84	3.39

Data preparation

You need to prepare the data jsonl in this format.

{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."}
...
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."}

Decode with checkpoints

bash decode_wavlm_large_linear_vicuna_7b.sh

Modify the path including speech_encoder_path, llm_path, output_dir, ckpt_path, val_data_path and decode_log in the script when you run the shell script.

Train a new model

Use whisper as the encoder

bash finetune_whisper_large_linear_vicuna_7b.sh

Whisper takes mel as input. Pay attention to the key dataset_config.mel_size for different version of the whisper model family.

Use self-supervised model(such as WavLM) as the encoder

bash finetune_wavlm_large_linear_vicuna_7b.sh

WavLM takes raw wavform as input. Pay attention to the key dataset_config.normalize and model_config.normalize for different version of the SSL models for different SSL models are different in these keys.

Note:

if you are running on a machine with multiple GPUs please make sure to only make one of them visible using export CUDA_VISIBLE_DEVICES=GPU:id
If you want to run with FSDP, you can set ++train_config.enable_fsdp=true and ++train_config.enable_ddp=false.

Flash Attention and Xformer Memory Efficient Kernels

Setting use_fast_kernels will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in optimum library from HuggingFace as a one-liner API, please read more here.

Fine-tuning using FSDP on 70B Model

If you are interested in running full parameter fine-tuning on the 70B model, you can enable low_cpu_fsdp mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with BF16 on 16xA100, 80GB GPUs.

Citation

You can refer to the paper for more results.

@article{ma2024embarrassingly,
  title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},
  author={Ma, Ziyang and Yang, Guanrou and Yang, Yifan and Gao, Zhifu and Wang, Jiaming and Du, Zhihao and Yu, Fan and Chen, Qian and Zheng, Siqi and Zhang, Shiliang and others},
  journal={arXiv preprint arXiv:2402.08846},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ASR_Librispeech

Model Architecture

Performance and checkpoints

Data preparation

Decode with checkpoints

Train a new model

Use whisper as the encoder

Use self-supervised model(such as WavLM) as the encoder

Flash Attention and Xformer Memory Efficient Kernels

Fine-tuning using FSDP on 70B Model

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ASR_Librispeech

Model Architecture

Performance and checkpoints

Data preparation

Decode with checkpoints

Train a new model

Use whisper as the encoder

Use self-supervised model(such as WavLM) as the encoder

Flash Attention and Xformer Memory Efficient Kernels

Fine-tuning using FSDP on 70B Model

Citation