You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Very nice repo! Thank you authors for your contribution.
And here is my situation: I have been trying to use about 20000 hours of open-source speech data to follow this repo (version 1.2.7) and start training Soundstream from scratch. I basically made no changes to this repo except setting the batch_size as:
trainer = SoundStreamTrainer(
soundstream,
audio_path_list=audio_path_list,
batch_size=12,
grad_accum_every=8, # effective batch size of 12*8==96
data_max_length_seconds=2, # train on 2 second audio
num_train_steps=1_000_000,
).cuda()
I have been running this on 4xA100 GPUs for a couple of days, and after it went over 10k steps, this kind of audio was obtained. There are some signs of speech formation, but noise is heavy. The total loss has always been around ~20, and maybe gradually decreasing to ~10. Based on my experience in training vocoders such as HIFIGAN/WAVEGAN, I think that the number of training steps may not be enough, and the high-frequency information has not been learned. However I am newbie in large model training so I'm not quite confident if I'm on the right track. Do I just need more training steps or perhaps something has went wrong?
If anyone has met with/solved a similar problem, please share some information.
8k steps:
9k steps:
And the gradients just went out of control after 10500 steps. I think it definitely failed, but doesn't know the reasons.
10.5k steps:
The text was updated successfully, but these errors were encountered:
Additional information:
soundstream = SoundStream(
codebook_size=1024,
rq_num_quantizers=8,
rq_groups=2,
# this paper proposes using multi-headed residual vector quantization - https://arxiv.org/abs/2305.02765
attn_window_size=128, # local attention receptive field at bottleneck
attn_depth=2
)
lr= 2e-4
I didn't change these, should I?)
Makiyuyuko
changed the title
Questions about training Soundstream: poor intelligibility after 10k steps. (sr=16k, B=96)
Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96)
Jun 29, 2023
Very nice repo! Thank you authors for your contribution.
And here is my situation: I have been trying to use about 20000 hours of open-source speech data to follow this repo (version 1.2.7) and start training Soundstream from scratch. I basically made no changes to this repo except setting the batch_size as:
I have been running this on 4xA100 GPUs for a couple of days, and after it went over 10k steps, this kind of audio was obtained. There are some signs of speech formation, but noise is heavy. The total loss has always been around ~20, and maybe gradually decreasing to ~10. Based on my experience in training vocoders such as HIFIGAN/WAVEGAN, I think that the number of training steps may not be enough, and the high-frequency information has not been learned. However I am newbie in large model training so I'm not quite confident if I'm on the right track. Do I just need more training steps or perhaps something has went wrong?
If anyone has met with/solved a similar problem, please share some information.
8k steps:
9k steps:
And the gradients just went out of control after 10500 steps. I think it definitely failed, but doesn't know the reasons.
10.5k steps:
The text was updated successfully, but these errors were encountered: