Skip to content

Latest commit

 

History

History
32 lines (26 loc) · 2.33 KB

README.md

File metadata and controls

32 lines (26 loc) · 2.33 KB

AAC_Audiocaps

Performance and checkpoints

We use EAT as the audio encoder in this repo. Be sure to set up the corresponding environments based on the instructions provided in each repository. Here are checkpoints and performance for training only the linear layer and training the linear layer with LLM tuning via LoRA.

Audio Encoder Projector LLM PEFT METEOR CIDEr SPICE SPIDEr
EAT-base (fine-tuned) Linear(~16.26M) vicuna-7b-v1.5 x 0.2508 0.7532 0.1853 0.4692
EAT-base (fine-tuned) Linear(~16.26M) vicuna-7b-v1.5 LoRA(~4.19M) 0.2606 0.7922 0.1852 0.4887

Data preparation

Prepare your jsonl data in the following format:

{"key": "Y7fmOlUlwoNg_1", "source": "/root/data/AudioCaps/waveforms/test/Y7fmOlUlwoNg.wav", "target": "Constant rattling noise and sharp vibrations"}
{"key": "Y6BJ455B1aAs_1", "source": "/root/data/AudioCaps/waveforms/test/Y6BJ455B1aAs.wav", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}

Ensure your data aligns with this structure for consistent results.

Model Training

To train the model, you could run the following command:

bash scripts/finetune_eat_audiocaps.sh

You could modify the variable including audio_encoder_path, llm_path, output_dir, train_jsonl_path and val_jsonl_path in the script to fit your setup. For training only the linear layer (without using LoRA or other PEFT methods), you can set the following parameters: use_peft=false and freeze_llm=true.

Inference

To perform inference with trained models, you could use this command:

bash scripts/inference_eat_audiocaps.sh

Ensure your environment is set up and data paths are correct to reproduce results.