Replies: 15 comments 81 replies
-
|
Beta Was this translation helpful? Give feedback.
-
thank you! I totally missed issue #80. few follow-up questions to you:
thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
Does SoundStream generalize to longer sequences than Sample 1: https://voca.ro/18gEb4xXekNy |
Beta Was this translation helpful? Give feedback.
-
Thanks, good to know all those things! it helps me a lot! |
Beta Was this translation helpful? Give feedback.
-
Adding here some follow-up question: in my SoundStream training i noticed that the adversarial loss always increases and the discs loss is at 1 (while the multi_spectral_recon loss is slowly dropping) does it happen to others? is that alright and if not how can I control the adversarial loss so the quality audio will be better? attaching some tensorboard graphs to demonstrate the results: (step = 256 samples here, so this model at least saw enough samples). |
Beta Was this translation helpful? Give feedback.
-
Hello guys! I try to train SoundStream model on 15к samples on downtempo music and I saw that after 1-2k steps loss don't fall and stuck around 20k-25k(as example soundstream total loss: 21725.792, soundstream recon loss: 0.066 ). Sometimes it falls on 18-19k zone but not often. Here my settings:
Thank you in advance! |
Beta Was this translation helpful? Give feedback.
-
Hello, soundstream = MusicLMSoundStream()
trainer = SoundStreamTrainer(
soundstream,
folder = './fma_large',
batch_size = 3,
grad_accum_every = 8,
data_max_length_seconds = 2, # train on 2 second audio
num_train_steps = 200000,
dl_num_workers = 16,
results_folder="./results",
accelerate_kwargs={
'log_with': "tensorboard",
'logging_dir': "./runs"
}
).cuda()
trainer.train() The output is,
|
Beta Was this translation helpful? Give feedback.
-
@amitaie did you use fp16 for using? when I use fp16, loss=nan, use bp16, trainning seems work. |
Beta Was this translation helpful? Give feedback.
-
I tried to train with the clean part of librispeech (100 + 360) for 20k steps, and it seems still robot sound there. data_max_length is 16000 * 2, with 16kHZ. Any suggestions? |
Beta Was this translation helpful? Give feedback.
-
@rgb000000 Hi, I am running into the same problem. did you solve it by increase the batch size?
|
Beta Was this translation helpful? Give feedback.
-
Throwing my hat into the ring here. This is obviously a preliminary result but I'd like to share and get some feedback.
The below image shows ~12.5k steps, or 6.4M samples. In this test the dataset is 450k samples. |
Beta Was this translation helpful? Give feedback.
-
Hi, sorry for the long silence. I can confirm the issues I was facing regarding quality degradation during inference after segment_length seconds are completely gone with the current SoundStream model implementation. The model is now able to generalize to longer sequences than those seen during training. I must reiterate however in my configuration I'm just using the SoundStream model from this repo (w/o EMA), but everything else (training code, reconstruction loss implementation, discriminators, etc.) I implemented these separately. Thanks again for sharing all your efforts with the community, greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
Hi, also sorry for the long silence. Just want to share some of my results after the CasulConv fix, I succed to train a pretty good SoundStream model. The model was trained only on LibriSpeech (no music), with 16000 sample rate, 32 samples of 1 sec- in a batch. started with generator warmup (50,000 steps) and no RVQ dropot. Also I changes the multi-spectral reconstruction loss weight to 0.02 so it won't be too strong. audio results on speech (validation):
audio results on classical music (OOD):
I also got pretty good results without attention at all, but still need to check it more carefully because i made few changes since then. the loss curves still disturb me because as you can see below (losses of 32 rvq) the commitment loss grows to a very high volume so i'm digging to it for the next few weeks, hopefully i'll find some way to improve it and get better results with fewer VQs. hope it helps someone and thanks for this repo and all the help!! |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing your results @amitaie . I can also confirm I've managed to train a very good performing model on 24kHz LibriTTS. I've used 3 discriminators (MPD, MSD, MS-STFT) and 2 x 8 GroupedResidualVQ similar to HiFi-codec. Here are the training loss figures from my config: And here's a test sample after only 100K iterations (50K steps after discriminators kicked in) [Spanish audio, even if the model was trained on English data]: https://github.com/lucidrains/audiolm-pytorch/assets/2833410/f1c8fbc8-b152-4687-b2a2-573c3778346b Thanks again for your great contributions @lucidrains . |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm trying to train SoundStream with sr=48000 on 2-second mono chunks of classical music. These are the settings. Every file in the dataset is a separate 2-second mono 48khz 320kbps mp3 file. from audiolm_pytorch import SoundStream, SoundStreamTrainer
if __name__ == "__main__":
soundstream = SoundStream(
codebook_size=4096,
strides=(3, 4, 5, 8),
target_sample_hz=48000,
rq_num_quantizers=12,
rq_groups=2,
use_lookup_free_quantizer=False,
use_finite_scalar_quantizer=False,
attn_window_size=128,
attn_depth=2,
)
trainer = SoundStreamTrainer(
soundstream,
folder="/chunked-2s",
batch_size=12,
grad_accum_every=4,
data_max_length_seconds=2,
num_train_steps=1_000_000,
dl_num_workers=16,
accelerate_kwargs={"log_with": "tensorboard", "project_dir": "./logs"},
).cuda()
trainer.train() After 10k steps I only get some form of low amplitude uniform buzz in the flac files saved to the result folders, and the the progress looks like this:
I noted that the Thanks for your help /George. |
Beta Was this translation helpful? Give feedback.
-
Hey, I want to train soundstream and as i understood from some issues here some people manged to do so. I have some questions on the training procedure, hopefully they are not dummy:
I tried to go over most of the issues before asking those questions, hope I didn't miss answers to this.
Thanks in advanced,
Beta Was this translation helpful? Give feedback.
All reactions