Finetuning bugs #100

saattrupdan · 2024-10-23T15:20:58Z

This fixes several issues with the finetuning:

Interleaving processed datasets sometimes caused issues, which was fixed by applying dataset processing after the interleaving.
When finetuning Whisper models, use the WhisperConfig.max_length as an upper bound for the model_max_length of the tokenizer. Previously we just had 512 as the upper bound, but some Whisper models require shorter context lengths. This context length is for transcriptions (i.e., labels) only.
Previously in multi-gpu setups, we had to set padding to 'max-length'. I don't recall why this was the case, but this now (for some reason) leads to issues. I've commented out the "forcing it to max-length" block now, along with a TODO comment that this might change in the future.
PyTorch dataloaders seem to try to be clever and splits up the batches for each device in a multi-gpu setup. But since we're already handling the splitting, we need to set dispatch_batches=False.

… for Whisper

…al into fix/whisper-finetuning

…in multigpu

…r-finetuning

…al into fix/whisper-finetuning

…ation` module

saattrupdan added 30 commits October 22, 2024 16:40

fix: Convert processor outputs to numpy arrays

c31a750

chore: Revert commit

2bb52f1

debug

d677a96

debug

5673c60

docs: Casing

f533a1e

debug: Temp logging

777e270

debug

bd6aa4a

docs: Remove temp comments

dbeb536

docs: Remove temp comment

c284fab

chore: Logging

034e277

debug

c9d7e65

chore: Revert

71bad8c

fix: Always process dataset, as it is necessary for training

215f451

debug

a52e06d

debug

91cd695

debug

7d3af9b

fix: Input features to numpy array in data collator

9085910

fix: Process dataset after interleave_datasets

d5611fd

fix: Cast audio to sampling rate before interleave_datasets

99ef2ac

chore: Update make recipe

fb38d03

chore: Make recipe

b6432e4

chore: Logging

26d3cfb

docs: Temp comment

8e4df84

chore: Use AutoModelForSpeechSeq2Seq for initiating Whisper

b2658df

chore: Use AutoProcessor

60daf3f

debug: Try using default_data_collator

ee08ffb

chore: Revert

ff46961

chore: Temporarily only use one dataset

9be064a

chore: Remove temp block

b8a2d99

chore: Revert

3541bae

saattrupdan added 22 commits October 22, 2024 19:32

chore: use pip3 to install pipx

4fdc302

chore: revert

8c7cbe2

fix: Use apt to install pipx on Linux

75efbff

debug

c88926b

fix: Use hf_config.max_length when setting tokenizer model_max_length…

be2ffba

… for Whisper

chore: Update lock file

dcd610f

chore: Update pre-commit

2c1ed86

fix: Use tokenizer model_max_length when padding to max_length

1bcac9e

chore: Re-enable hook

a4a558f

chore: Update lock file

e216bea

fix: Set split_batches=True

26f241a

Merge branch 'fix/whisper-finetuning' of github.com:alexandrainst/cor…

dd24672

…al into fix/whisper-finetuning

fix: Temporarily disable specaugment

6338756

fix: Set feature_size and num_mel_bins to 160k

d40f6ea

fix: Set ignore_mismatched_sizes=True

8b4a2b5

chore: Revert

6dd8fb7

fix: Disable split_batches again

6a026a6

fix: Do not force padding to be max_length in multigpu setups

1917e07

fix: Set dispatch_batches=False

38cdcb2

docs: Add TODO comment regarding the commenting of max length change …

d4e9559

…in multigpu

Merge branch 'main' of github.com:alexandrainst/coral into fix/whispe…

c477ce9

…r-finetuning

chore: Update gradio

3b5e084

saattrupdan self-assigned this Oct 23, 2024

saattrupdan changed the title ~~Fix/whisper finetuning~~ Finetuning bugs Oct 23, 2024

saattrupdan added 3 commits October 23, 2024 17:21

fix: Set dispatch_batches=False for wav2vec2 models as well

9b68b86

Merge branch 'fix/whisper-finetuning' of github.com:alexandrainst/cor…

70899f6

…al into fix/whisper-finetuning

fix: Remove cast_to_sampling_rate from process_dataset call in `valid…

dff2c7a

…ation` module

saattrupdan merged commit 559354b into main Oct 23, 2024
4 checks passed

saattrupdan deleted the fix/whisper-finetuning branch October 23, 2024 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning bugs #100

Finetuning bugs #100

saattrupdan commented Oct 23, 2024 •

edited

Loading

Finetuning bugs #100

Finetuning bugs #100

Conversation

saattrupdan commented Oct 23, 2024 • edited Loading

saattrupdan commented Oct 23, 2024 •

edited

Loading