Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with preprocessed txt input and mel-spectrogram input #133

Open
youuuw opened this issue Oct 1, 2021 · 1 comment
Open

Training with preprocessed txt input and mel-spectrogram input #133

youuuw opened this issue Oct 1, 2021 · 1 comment

Comments

@youuuw
Copy link

youuuw commented Oct 1, 2021

Hi,
Thank you for great paper!
I've been having problems training a Flowtron model with my own dataset on 8 Tesla V100.

Some information about this dataset:

  1. The text inputs are sequences of ids that each represents a phoneme in a provided dictionary.
  2. The mel-spectrograms are extracted offline with different hyper-parameters from the default ones provied in the config.json file in this repo.
  3. The dataset is in English.
  4. The dataset has only one speaker.
  5. The dataset has around 11k sentences in training set and 130 sentences in validation set.
  6. The maximum frame length is 300.

My problem is that the nll loss starts shaking tremedously after reaching a certain number. I've tried different combinations of learning rate and weight decay, the shaky loss is not improved whatsoever. I'm wondering is this is normal as I didn't see similar situation in the issues in this repo. The loss can go up to over 10 quite often.

The picture of the loss curve
Screenshot 2021-10-02 at 03 08 58

I will also attach the config that I used to train
{ "train_config": { "output_directory": "output_dir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-7, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 32, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.01, "blank_logprob": -8, "ctc_loss_start_iter": 10000 }, "data_config": { "train_tdd": "train.tdd", "val_tdd": "val.tdd", "mf_dirs": ["mf", "mf_2.0"], "lf_dirs": ["lf", "lf_2.0"], "speaker_format": "label", "speaker_dir": "", "speaker_stream": "", "speaker_regex": ["laura"], "text_cleaners": ["flowtron_cleaners"], "randomize": false, "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 24000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false, "max_frame_length": 300 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 84, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
Any insights would be appreciated!

@andi-808
Copy link

I would say that you have a bad training example. The text may not match the clip exactly. I found that my graphs would look choppy like this when the data was bad. as soon as I cleaned up the errors, it went away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants