Asking for some SoundStream training info #107

amitaie · 2023-02-22T16:06:49Z

amitaie
Feb 22, 2023

Hey, I want to train soundstream and as i understood from some issues here some people manged to do so. I have some questions on the training procedure, hopefully they are not dummy:

What dataset to use in order to get results that indicate that the training is working well? i saw some people choose LibriSpeech (which is in 16Khz, right?) why not libri-light? Libri-tts?
in respect to 1 - how many steps did the training take till it gets to "not only noise"? and how much setps till it sounds well?
how long takes each step? (or how many steps you run in a 24hrs).
Does anybody fill like sharing some tensorbords graphs so i will have an idea if the training is in the right direction?

I tried to go over most of the issues before asking those questions, hope I didn't miss answers to this.
Thanks in advanced,

LWprogramming · 2023-02-22T20:57:54Z

LWprogramming
Feb 22, 2023

#80

I personally picked LibriSpeech because it was the one I saw other people using and I wanted to make sure I could replicate their work first :)
20k to get a voice kind of recognizable, probably significantly (though probably less than an order of magnitude) more (this was testing on 0.11.1, possibly things have changed since then)
about 2-3 seconds a step at batch_size=4. Step time seemed directly proportional to batch_size in the run I posted in the thread
typical range of num_train_steps? #80 (comment)

0 replies

amitaie · 2023-02-23T09:08:21Z

amitaie
Feb 23, 2023
Author

thank you! I totally missed issue #80.

few follow-up questions to you:

if you used LibriSpeech, you resamples the data or changed the model to work with 16000? (meaning strides and so on).
I am using batch_size=4 and grad_accum_every=8 and 2 sec max data, but each step takes me 6 sec. I'm trying to speed up the training, anything that coms to mind? i'm using one machine of NVIDIA RTX 3090 with 8 workers for the data loader.

thanks a lot!

1 reply

LWprogramming Feb 23, 2023

There's some active discussion of 16 kHz at the moment but I'm not sure if this was relevant in 0.11.1
you can try replicating the work in the notebook here but note that the audiolm-pytorch version is several minor releases behind as a result. Still, you should be able to at least verify if it's a setup issue or just that the code is different.

alexdemartos · 2023-02-28T13:15:23Z

alexdemartos
Feb 28, 2023

Does SoundStream generalize to longer sequences than data_max_length? I'm experiencing a strange behaviour where the first data_max_length inference is good, but then it goes worse as time goes by.

Sample 1: https://voca.ro/18gEb4xXekNy
Sample 2: https://voca.ro/1nOQZQ3Dsx3N

29 replies

lucidrains Mar 3, 2023
Maintainer

@ilya16 🤦‍♂️ should be resolved!

alexdemartos Mar 15, 2023

Hi. Sorry, it's been a while. I finished the experiment (Train a new SS model from scratch (w/o MultiHeadEMA, w/ Local Attn)). Disabling MultiHeadEMA generalizes much better to longer sequences. However, I can still perceive a slight quality degradation over time after the first second.

lucidrains Mar 15, 2023
Maintainer

@alexdemartos ah nice! glad to hear that things improved (and thank you Ilya for noting that there was an issue with the MHEMA)

were your most recent runs using dynamic positional bias? i'm afraid that's the biggest gun i have for length extrapolation

lucidrains Mar 15, 2023
Maintainer

@alexdemartos could you share any of your audio reconstructions? for comparison with the samples you shared previously

lucidrains Mar 15, 2023
Maintainer

@alexdemartos if you had used dynamic positional bias, i think the next best thing would be to curriculum learn greater lengths throughout training, or just finetune at the end could work as well. i've seen a few papers do this for language modeling

amitaie · 2023-03-02T10:26:05Z

amitaie
Mar 2, 2023
Author

Thanks, good to know all those things! it helps me a lot!
Blockwise quantization dropout is form Encodec? i don't see it in the paper nor at the code, have some reference that i can learn about it?

16 replies

exercise-book-yq Mar 29, 2023

@amitaie I can get good results without rvq, and it still shows that the discriminator is too powerful.

amitaie Mar 30, 2023
Author

@lucidrains

@amitaie @alexdemartos not entirely sure if implemented correctly, but try setting 369e8e2#diff-7e2ef88e24ccc344bf6eeeefc84a9daf25616a49d07c7bfd0473b3e039909de4R365 to 4

I think your implentation is correct and indeed a mult of 4, but maybe they made a mistake in the paper because they said:

When doing variable bandwidth training, we select randomly a number of codebooks as a multiple of 4, i.e. corresponding to a bandwidth 1.5, 3, 6, 12 or 24 kbps at 24 kHz.

and when I did calculation i got:
1.5 kbps = 75 * 10 * 2
3 kbps = 75 * 10 * 4
6 kbps = 75 * 10 * 8
12 kbps = 75 * 10 * 16
24 kbps = 75 * 10 * 32

so there is 2 and they didn't use 12, 20 or 24. i don't see how that is multiple of 4, that's more like a power of 2.
Am i wrong?

pranavmalikk Apr 4, 2023

@exercise-book-yq do you have any samples with the pre-trained generator being used?

exercise-book-yq Apr 7, 2023

@pranavmalikk Even though I used a pre-trained model, the results are still not very good and differ significantly from Soundstream.

pranavmalikk Apr 23, 2023

@exercise-book-yq how were you able to conclude the discriminator is too powerful? do you have some charts/tests?

amitaie · 2023-03-15T13:44:39Z

amitaie
Mar 15, 2023
Author

Adding here some follow-up question:

in my SoundStream training i noticed that the adversarial loss always increases and the discs loss is at 1 (while the multi_spectral_recon loss is slowly dropping) does it happen to others? is that alright and if not how can I control the adversarial loss so the quality audio will be better?

attaching some tensorboard graphs to demonstrate the results:

(step = 256 samples here, so this model at least saw enough samples).

6 replies

ilya16 Mar 15, 2023

Since ComplexSTFTDiscriminator returns non-negative logits, it looks like it always returns >1 for real and 0 for fake.

audiolm-pytorch/audiolm_pytorch/soundstream.py

Lines 54 to 55 in 3c63957

    
           def hinge_discr_loss(fake, real): 
        
               return (F.relu(1 + fake) + F.relu(1 - real)).mean()

lucidrains Mar 16, 2023
Maintainer

@ilya16 oh yes, that is true, i'm actually not sure how best to represent the output of the complex stft discriminator

decided to make it an option, thanks!

exercise-book-yq Mar 17, 2023

Hello, I would like to ask you which release you are using to train soundstream. I am using 0.22.3 but I don't have good results at 60k steps

amitaie Mar 22, 2023
Author

@exercise-book-yq I cloned the code in version 15.9 but i don't get good results either. nevertheless 60k steps is not enough.

@lucidrains either ways my discriminator loss collapsed, have any idea way? can't find a way to keep it stable.

exercise-book-yq Mar 22, 2023

I have cloned the code in many versions, such as 0.11.1, 0.7.3, 0.23.7. I've trained over 150k steps and still no good results, now I'm trying 0.18.0 and turning up the batchsize.

Karma-Cat · 2023-03-17T08:02:16Z

Karma-Cat
Mar 17, 2023

Hello guys!

I try to train SoundStream model on 15к samples on downtempo music and I saw that after 1-2k steps loss don't fall and stuck around 20k-25k(as example soundstream total loss: 21725.792, soundstream recon loss: 0.066 ). Sometimes it falls on 18-19k zone but not often.
Could you please tell me, should I use some warm+up + cosine annealing for lr or is there any settings to fix this?

Here my settings:
soundstream = SoundStream( strides = (3, 4, 5, 8), rq_num_quantizers = 12, codebook_size = 1024, attn_window_size = 128, attn_depth = 2 )
(did not use MusicLMSoundStream but added necessary parameters to SoundStream so that it was clearly visible)

trainer = SoundStreamTrainer( soundstream, folder = '/path_ to_files', batch_size = 8, save_model_every = 1000, save_results_every = 500, grad_accum_every = 8, data_max_length_seconds = 3, # I see 3 sec audio samples in paper num_train_steps = 5000 )

Thank you in advance!

2 replies

lucidrains Mar 17, 2023
Maintainer

as long as it does not explode, just keep training. try 1 million steps

amitaie Mar 22, 2023
Author

@Karma-Cat if you got good results i would love to hear. in my training the discriminator loss collapsed and i think it causes the model to sound very robotic.

rgb000000 · 2023-03-20T14:55:34Z

rgb000000
Mar 20, 2023

Hello,
I try to train a SoundStream on FMA dataset. But after tens of thousands of steps, the loss becomes very large.
I've tried several times and it's the same.
Any advice is greatly appreciated.
Here is my training code. The version of audiolm-pytorch is 0.23.5 .

    soundstream = MusicLMSoundStream()

    trainer = SoundStreamTrainer(
        soundstream,
        folder = './fma_large',
        batch_size = 3,
        grad_accum_every = 8,  
        data_max_length_seconds = 2, # train on 2 second audio
        num_train_steps = 200000,
        dl_num_workers = 16,
        results_folder="./results",
        accelerate_kwargs={
            'log_with': "tensorboard",
            'logging_dir': "./runs"
        }
    ).cuda()

    trainer.train()

The output is,

81258: soundstream total loss: 486710594540404736.000, soundstream recon loss: 1543119421440.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000scale 0.25) loss: 0.000
81259: soundstream total loss: 10458864304907091968.000, soundstream recon loss: 1539706388480.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 000 (scale 0.25) loss: 0.000
81260: soundstream total loss: 591017856758448128.000, soundstream recon loss: 1542862290944.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000scale 0.25) loss: 0.000
81261: soundstream total loss: 474846876961603584.000, soundstream recon loss: 1543341654016.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000scale 0.25) loss: 0.000
81262: soundstream total loss: 568241035302404096.000, soundstream recon loss: 1543190216704.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000scale 0.25) loss: 0.000
81263: soundstream total loss: 534959458279751680.000, soundstream recon loss: 1543793770496.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000scale 0.25) loss: 0.000
81264: soundstream total loss: 578829336572854272.000, soundstream recon loss: 1542578733056.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000scale 0.25) loss: 0.000
81265: soundstream total loss: 1049950673002561536.000, soundstream recon loss: 1542203817984.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.00(scale 0.25) loss: 0.000
81266: soundstream total loss: 2597189371253751808.000, soundstream recon loss: 1540922720256.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.00(scale 0.25) loss: 0.000
81267: soundstream total loss: 1001135479170531328.000, soundstream recon loss: 1541039063040.000 | discr (scale 1) loss: 0.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.000(scale 0.25) loss: 0.000

3 replies

lucidrains Mar 20, 2023
Maintainer

your batch size is way too small

i would recommend setting your grad_accum_every = 16

lucidrains Mar 20, 2023
Maintainer

@rgb000000 if you still have trouble, please post your full training curve as well as the last sample before it diverges

rgb000000 Mar 21, 2023

Thanks!
I will try with bigger batch size.

Liujingxiu23 · 2023-03-22T07:44:47Z

Liujingxiu23
Mar 22, 2023

@amitaie did you use fp16 for using? when I use fp16, loss=nan, use bp16, trainning seems work.

2 replies

amitaie Mar 22, 2023
Author

@Liujingxiu23 didn't try mix precision yet, im still struggling to stable the training and specifically the discriminator loss.

Liujingxiu23 Mar 23, 2023

thank you for you reply

yygle · 2023-03-23T02:00:32Z

yygle
Mar 23, 2023

I tried to train with the clean part of librispeech (100 + 360) for 20k steps, and it seems still robot sound there. data_max_length is 16000 * 2, with 16kHZ. Any suggestions?

1 reply

lucidrains Mar 23, 2023
Maintainer

train for a million steps

gblue1223 · 2023-03-27T00:47:19Z

gblue1223
Mar 27, 2023

@rgb000000 Hi, I am running into the same problem. did you solve it by increase the batch size?
My graphic card has small vram. I can't increase the batch size. Any other way?

target_sample_hz = 16000
data_max_length_seconds = 3

save_results_every = 100
save_model_every = 1000
soundstream_train_steps = 61001

...

soundstream = SoundStream(
    codebook_size=1024,
    target_sample_hz=target_sample_hz,
    rq_num_quantizers=12,
    attn_window_size=128,  # local attention receptive field at bottleneck
    attn_depth=2,  # 2 local attention transformer blocks - the soundstream folks were not experts with attention, so i took the liberty to add some. encodec went with lstms, but attention should be better
)

trainer = SoundStreamTrainer(
    soundstream,
    folder=dataset,
    results_folder=dataset_result,
    batch_size=2,
    grad_accum_every=16,  # effective batch size of 32
    data_max_length_seconds=data_max_length_seconds,
    num_train_steps=soundstream_train_steps,
    save_results_every=save_results_every,
    save_model_every=save_model_every,
    force_clear_prev_results=force_clear_prev_results,
    accelerate_kwargs={
        'log_with': "tensorboard",
        'project_dir': "./logs"
    }
).cuda()

0 replies

hmartiro · 2023-03-28T03:44:47Z

hmartiro
Mar 28, 2023

Throwing my hat into the ring here. This is obviously a preliminary result but I'd like to share and get some feedback.

Tip of main (0.25.5) using MusicLMSoundStream parameters
Learning rate 1e-4
data_max_length_seconds = 2
batch size = 64, gradient accumulation = 8, so total batch size per step is 512
Takes about 7 seconds per step (512 samples)

The below image shows ~12.5k steps, or 6.4M samples. In this test the dataset is 450k samples.

9 replies

lucidrains Apr 1, 2023
Maintainer

@hmartiro hey Hayk, i would say djqualia in this pull request and Alejandro are the two who have publicly stated that they have successfully trained it. there may be others who have trained it successfully in private too (usually they go radio silent)

lucidrains Apr 1, 2023
Maintainer

i'm moving onto some other work next month, but will get back to audio stuff maybe end of April

lucidrains Apr 14, 2023
Maintainer

@hmartiro would you like to give it another go given the resolution of this issue? #166

turian Apr 27, 2023

@lucidrains where are Alejandro's results?

@hmartiro What I don't see in your curves, but perhaps would expect (not sure because these curves for GANs aren't often published) is that the discriminator loss should drop, then increase and oscillate

alexdemartos May 3, 2023

Hi, sorry, it's been a while. As I mentioned, in my configuration I'm just using SoundStream model from this repo, but everything else (training code, reconstruction loss implementation, discriminators, etc.) I implemented these separately. From your training figures, I'm still very concerned about the multi-spectral reconstruction loss implementation here. I believe this loss value being orders of magnitude higher than adversarial loss (and others) might be harming the adversarial training scheme.

I'd suggest two alternatives:

Modify current multi-spectral reconstruction loss implementation
Use a very small weight for the multi-spectral reconstruction loss so that it falls into the [0.5 - 5.0] value range

alexdemartos · 2023-05-03T07:29:58Z

alexdemartos
May 3, 2023

Hi, sorry for the long silence. I can confirm the issues I was facing regarding quality degradation during inference after segment_length seconds are completely gone with the current SoundStream model implementation. The model is now able to generalize to longer sequences than those seen during training.

I must reiterate however in my configuration I'm just using the SoundStream model from this repo (w/o EMA), but everything else (training code, reconstruction loss implementation, discriminators, etc.) I implemented these separately.

Thanks again for sharing all your efforts with the community, greatly appreciated.

1 reply

lucidrains May 3, 2023
Maintainer

@alexdemartos thanks for sharing your experimental results Alex! this is great help to me

amitaie · 2023-05-04T22:32:36Z

amitaie
May 4, 2023
Author

Hi, also sorry for the long silence. Just want to share some of my results after the CasulConv fix, I succed to train a pretty good SoundStream model.

The model was trained only on LibriSpeech (no music), with 16000 sample rate, 32 samples of 1 sec- in a batch. started with generator warmup (50,000 steps) and no RVQ dropot. Also I changes the multi-spectral reconstruction loss weight to 0.02 so it won't be too strong.

audio results on speech (validation):

audio results on classical music (OOD):

I also got pretty good results without attention at all, but still need to check it more carefully because i made few changes since then.

the loss curves still disturb me because as you can see below (losses of 32 rvq) the commitment loss grows to a very high volume so i'm digging to it for the next few weeks, hopefully i'll find some way to improve it and get better results with fewer VQs.

hope it helps someone and thanks for this repo and all the help!!

5 replies

lucidrains May 4, 2023
Maintainer

@amitaie thank you for sharing this! also quite surprised attention didn't help. can anyone else corroborates this?

pranavmalikk May 20, 2023

Hey @amitaie i'm trying to replicate some of these results but i'm not able to configure my loss function in the same way. The following is my setup:

batch size 32 with 8 gradient accumulation
32 rq_num_quantizers
attn_dynamic_pos_bias = True
lr: float = 2e-4
EMA - true
self.rq = ResidualVQ(
dim = codebook_dim,
num_quantizers = rq_num_quantizers,
codebook_size = codebook_size,
decay = rq_ema_decay,
commitment_weight = rq_commitment_weight,
kmeans_init = True,
threshold_ema_dead_code = 2,
quantize_dropout = False,
quantize_dropout_cutoff_index = quantize_dropout_cutoff_index
)
multi_spectral_recon_loss_weight = .02

I would love to know your thoughts on what an optimal setup is and if i'm missing some setup information here

amitaie May 21, 2023
Author

first, i'm using batch size 32 without gradient accumulation and I do so warmup to the generator.

what result do you get? it's hard to give an advice withput some more info, if you have some graphs like i posted here that might help me to help you.

hmartiro May 22, 2023

@amitaie thanks for your post. Are you using the SoundStreamTrainer in this repo or your own code? Would be keen to see the fork if you have it. Also, is the generator warmup implemented in the code currently, or where did you add it?

pranavmalikk May 22, 2023

here are my results after ~20k steps, something seems clearly wrong with my loss function as it starts increasing even before my discriminator starts running. I'm using 32 batch size with no gradient accumulation, 32 quantizers, attn_dynamic_pos_bias as True, use_ema is True. I'm warming up generator for 50k steps. multi spectral recon weight is .02. my feature loss is 100, but i believe i need to trim this down to around 2

alexdemartos · 2023-05-10T12:15:34Z

alexdemartos
May 10, 2023

Thanks for sharing your results @amitaie . I can also confirm I've managed to train a very good performing model on 24kHz LibriTTS. I've used 3 discriminators (MPD, MSD, MS-STFT) and 2 x 8 GroupedResidualVQ similar to HiFi-codec.

Here are the training loss figures from my config:

And here's a test sample after only 100K iterations (50K steps after discriminators kicked in) [Spanish audio, even if the model was trained on English data]: https://github.com/lucidrains/audiolm-pytorch/assets/2833410/f1c8fbc8-b152-4687-b2a2-573c3778346b

Thanks again for your great contributions @lucidrains .

6 replies

lucidrains May 10, 2023
Maintainer

@alexdemartos do you have an opinion on the results of the naturalspeech2 paper? your thought would carry a lot of weight for me

alexdemartos May 10, 2023

Hi @lucidrains . I believe Natural Speech 2 (NS2) approach is not that different from other 2-step approaches to TTS (plus the addition of the speech prompt mechanism). There's the acoustic model that predicts an intermediate feature representation (usually mel spectrograms, latent vectors in NS2) based on the input text + some sort of speaker information (speaker embeddings, or the prompt mechanism in NS2). And then there's a vocoder model that predicts the final waveform conditioned on this intermediate feature representation (SoundStream / EnCodec in NS2).

I find very interesting the choice of this latent representation as the intermediate feature as opposed to mel spectrograms. However, I'm not sure how much NS2 benefits from this choice (if at all). Then, the other interesting part is the addition of the speech prompt mechanism to condition the variance predictors and the diffusion decoder. The results seem to be very satisfactory, and I generally find non-autoregressive approaches dealing much better with the TTS task as compared to autoregressive ones (such as VALL-E), which are well known to suffer from robustness issues during inference.

I'm not sure if this is of any help to you, or if you'd like me to comment on some other aspects of NS2 results.

lucidrains May 10, 2023
Maintainer

no that's perfect, thanks for your assessment! i think i will continue to put more effort into open sourcing on the non-autoregressive front

lucasnewman Jul 4, 2023

I was able to train a pretty reasonable 24kHz model using some of the guidance here against the current version with only minor tweaks.

I did 50k steps of generator-only warmup (lr = 2e-4, batch size = 128, grad accum = 1) and then turned on the discriminators for the next 50k steps (lr = 1e-4, batch size = 16, grad accum = 1). The dataset was the dev-clean subset of LibriTTS (data length = 1 sec), and no EMA for simplicity. It took about 17.5 hours of training on a single A100.

Here's the model spec I used:

soundstream = SoundStream(
    target_sample_hz = 24000,
    channels = 32,
    strides=(3, 4, 5, 8),
    rq_num_quantizers = 8,
    rq_groups = 2,
    multi_spectral_recon_loss_weight = 1e-2,  # 1 when generator only
    adversarial_loss_weight = 1,   # 0 when generator only
    feature_loss_weight = 100,  # 0 when generator only
)

During the generator warm up, I used a multispectral reconstruction loss weight of 1 as described in the paper, but I dropped it down to 1e-2 once I enabled the adversarial loss, because the multispectral objective seemed to be overpowering it. You can hear the robotic-sounding artifacts get resolved during adversarial training. The quality isn't quite to the level of Alex's model with the HiFi-codec discriminators, but it's not bad for only 100k steps — I'm sure it could get better with more training!

@lucidrains fwiw, I experimented with local attention enabled/disabled and in my runs it seemed to help prevent codebook collapse in the RVQ earlier on in training, so I kept it on 👍

Example 1:
Original
50k steps
100k steps

Example 2:
Original
100k steps

Checkpoint:
100k steps

lucidrains Jul 4, 2023
Maintainer

@lucasnewman awesome Lucas! thank you for sharing your results (and for being a sponsor)! enjoy the fireworks tonight in SF 🎆 😄

gyohng · 2024-11-08T21:24:37Z

gyohng
Nov 8, 2024

Hello,

I'm trying to train SoundStream with sr=48000 on 2-second mono chunks of classical music. These are the settings. Every file in the dataset is a separate 2-second mono 48khz 320kbps mp3 file.

from audiolm_pytorch import SoundStream, SoundStreamTrainer

if __name__ == "__main__":
    soundstream = SoundStream(
        codebook_size=4096,
        strides=(3, 4, 5, 8),
        target_sample_hz=48000,
        rq_num_quantizers=12,
        rq_groups=2,
        use_lookup_free_quantizer=False,
        use_finite_scalar_quantizer=False,
        attn_window_size=128,
        attn_depth=2,
    )

    trainer = SoundStreamTrainer(
        soundstream,
        folder="/chunked-2s",
        batch_size=12,
        grad_accum_every=4,
        data_max_length_seconds=2,
        num_train_steps=1_000_000,
        dl_num_workers=16,
        accelerate_kwargs={"log_with": "tensorboard", "project_dir": "./logs"},
    ).cuda()

    trainer.train()

After 10k steps I only get some form of low amplitude uniform buzz in the flac files saved to the result folders, and the the progress looks like this:

...
10735: soundstream total loss: 13.240, soundstream recon loss: 0.009 | discr (scale 1) loss: 1.868 | discr (scale 0.5) loss: 1.892 | discr (scale 0.25) loss: 1.892
10736: soundstream total loss: 17.211, soundstream recon loss: 0.015 | discr (scale 1) loss: 1.865 | discr (scale 0.5) loss: 1.890 | discr (scale 0.25) loss: 1.893
10737: soundstream total loss: 14.121, soundstream recon loss: 0.010 | discr (scale 1) loss: 1.862 | discr (scale 0.5) loss: 1.888 | discr (scale 0.25) loss: 1.890
10738: soundstream total loss: 14.993, soundstream recon loss: 0.011 | discr (scale 1) loss: 1.848 | discr (scale 0.5) loss: 1.876 | discr (scale 0.25) loss: 1.879
10739: soundstream total loss: 16.238, soundstream recon loss: 0.014 | discr (scale 1) loss: 1.880 | discr (scale 0.5) loss: 1.902 | discr (scale 0.25) loss: 1.905
10740: soundstream total loss: 13.985, soundstream recon loss: 0.011 | discr (scale 1) loss: 1.888 | discr (scale 0.5) loss: 1.909 | discr (scale 0.25) loss: 1.909
10741: soundstream total loss: 12.969, soundstream recon loss: 0.011 | discr (scale 1) loss: 1.847 | discr (scale 0.5) loss: 1.877 | discr (scale 0.25) loss: 1.881
10742: soundstream total loss: 14.042, soundstream recon loss: 0.010 | discr (scale 1) loss: 1.896 | discr (scale 0.5) loss: 1.915 | discr (scale 0.25) loss: 1.917
10743: soundstream total loss: 11.491, soundstream recon loss: 0.009 | discr (scale 1) loss: 1.890 | discr (scale 0.5) loss: 1.910 | discr (scale 0.25) loss: 1.913

batch_size was chosen to be 12 (as this nearly fits 2xH100 GPU RAM entirely), and I adjusted grad_accum_every to approximately match the total value specified in the original example on the home page.

I noted that the total loss scale is very different from what people posted before. I wonder if there's any obvious mistake in my approach that I have to address for it to train.

Thanks for your help

/George.

0 replies

Asking for some SoundStream training info #107

Replies: 15 comments · 81 replies

amitaie Feb 23, 2023 Author

lucidrains Mar 3, 2023 Maintainer

lucidrains Mar 15, 2023 Maintainer

lucidrains Mar 15, 2023 Maintainer

lucidrains Mar 15, 2023 Maintainer

amitaie Mar 2, 2023 Author

amitaie Mar 30, 2023 Author

amitaie Mar 15, 2023 Author

lucidrains Mar 16, 2023 Maintainer

amitaie Mar 22, 2023 Author

lucidrains Mar 17, 2023 Maintainer

amitaie Mar 22, 2023 Author

lucidrains Mar 20, 2023 Maintainer

lucidrains Mar 20, 2023 Maintainer

amitaie Mar 22, 2023 Author

lucidrains Mar 23, 2023 Maintainer

lucidrains Apr 1, 2023 Maintainer

lucidrains Apr 1, 2023 Maintainer

lucidrains Apr 14, 2023 Maintainer

lucidrains May 3, 2023 Maintainer

amitaie May 4, 2023 Author

lucidrains May 4, 2023 Maintainer

amitaie May 21, 2023 Author

lucidrains May 10, 2023 Maintainer

lucidrains May 10, 2023 Maintainer

lucidrains Jul 4, 2023 Maintainer

Replies: 15 comments 81 replies

amitaie
Feb 23, 2023
Author

lucidrains Mar 3, 2023
Maintainer

lucidrains Mar 15, 2023
Maintainer

lucidrains Mar 15, 2023
Maintainer

lucidrains Mar 15, 2023
Maintainer

amitaie
Mar 2, 2023
Author

amitaie Mar 30, 2023
Author

amitaie
Mar 15, 2023
Author

lucidrains Mar 16, 2023
Maintainer

amitaie Mar 22, 2023
Author

lucidrains Mar 17, 2023
Maintainer

amitaie Mar 22, 2023
Author

lucidrains Mar 20, 2023
Maintainer

lucidrains Mar 20, 2023
Maintainer

amitaie Mar 22, 2023
Author

lucidrains Mar 23, 2023
Maintainer

lucidrains Apr 1, 2023
Maintainer

lucidrains Apr 1, 2023
Maintainer

lucidrains Apr 14, 2023
Maintainer

lucidrains May 3, 2023
Maintainer

amitaie
May 4, 2023
Author

lucidrains May 4, 2023
Maintainer

amitaie May 21, 2023
Author

lucidrains May 10, 2023
Maintainer

lucidrains May 10, 2023
Maintainer

lucidrains Jul 4, 2023
Maintainer