Training with preprocessed txt input and mel-spectrogram input #133

youuuw · 2021-10-01T19:16:52Z

Hi,
Thank you for great paper!
I've been having problems training a Flowtron model with my own dataset on 8 Tesla V100.

Some information about this dataset:

The text inputs are sequences of ids that each represents a phoneme in a provided dictionary.
The mel-spectrograms are extracted offline with different hyper-parameters from the default ones provied in the config.json file in this repo.
The dataset is in English.
The dataset has only one speaker.
The dataset has around 11k sentences in training set and 130 sentences in validation set.
The maximum frame length is 300.

My problem is that the nll loss starts shaking tremedously after reaching a certain number. I've tried different combinations of learning rate and weight decay, the shaky loss is not improved whatsoever. I'm wondering is this is normal as I didn't see similar situation in the issues in this repo. The loss can go up to over 10 quite often.

The picture of the loss curve

I will also attach the config that I used to train
{ "train_config": { "output_directory": "output_dir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-7, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 32, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.01, "blank_logprob": -8, "ctc_loss_start_iter": 10000 }, "data_config": { "train_tdd": "train.tdd", "val_tdd": "val.tdd", "mf_dirs": ["mf", "mf_2.0"], "lf_dirs": ["lf", "lf_2.0"], "speaker_format": "label", "speaker_dir": "", "speaker_stream": "", "speaker_regex": ["laura"], "text_cleaners": ["flowtron_cleaners"], "randomize": false, "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 24000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false, "max_frame_length": 300 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 84, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
Any insights would be appreciated!

The text was updated successfully, but these errors were encountered:

andi-808 · 2022-06-28T18:42:34Z

I would say that you have a bad training example. The text may not match the clip exactly. I found that my graphs would look choppy like this when the data was bad. as soon as I cleaned up the errors, it went away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with preprocessed txt input and mel-spectrogram input #133

Training with preprocessed txt input and mel-spectrogram input #133

youuuw commented Oct 1, 2021 •

edited

Loading

andi-808 commented Jun 28, 2022

Training with preprocessed txt input and mel-spectrogram input #133

Training with preprocessed txt input and mel-spectrogram input #133

Comments

youuuw commented Oct 1, 2021 • edited Loading

andi-808 commented Jun 28, 2022

youuuw commented Oct 1, 2021 •

edited

Loading