Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27

Open
AK391 opened this issue Oct 26, 2021 · 9 comments
Open

Comments

@AK391
Copy link

AK391 commented Oct 26, 2021

just saw this paper https://arxiv.org/abs/2110.12676, when will the repo be updated for this thanks

@980202006
Copy link

Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used

@wookladin
Copy link
Contributor

@AK391
Thanks for your interest!
Currently, we don't have a specific plan to release the code of that paper.
We will add the link to the paper and demo page at README soon.

@980202006
We just used nn.Embedding without pre-training. Thanks!

@980202006
Copy link

Thank you!

@iehppp2010
Copy link

iehppp2010 commented Dec 28, 2021

@wookladin

I have tried to reproduce this paper.
My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.
image

After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide.
Below is the HiFi-GAN model fine tuning loss
image

After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the
way you said in the paper.
I use a trained audio of CSD female speaker as the reference audio(the link below).
https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing
I use the speaker PAMR in NUS-48E dataset as target speaker.
https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing
The result audio is:
https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing

I found that lyrics are hard to hear clearly.

My dataset config:
devset:
CSD speaker, these three audio en48/en49/en50 were chosed;
NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed.
trainset: the other songs in CSD and NUS-48E.

My speaker embedding dimension is 256.( It seems 256 is too large?)

I want to know what could be the problem with my model?
And can you share you Decoder model train/dev loss?
My Decoder model got a relative larger mel MSE loss on devset than trainset.

@wookladin
Copy link
Contributor

@iehppp2010
Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly.
As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK.
Did you transfer from those weights?
You can find pre-trained weights in this Google Drive link.

@iehppp2010
Copy link

iehppp2010 commented Dec 29, 2021

@wookladin
Thanks for your quick reply.
I do used the pre-trained weights.
When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed.
I found the plotted alignment is not as good as other TTS model,e.g. Tacotron2.
image

I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result?
Wish your reply.

@wookladin
Copy link
Contributor

@iehppp2010
Yes.
You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset.
It would generate better alignment and sample quality

@iehppp2010
Copy link

@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality

@wookladin
Thanks for you quickly reply.
After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction
got mininum value about 0.5 at step 3893.

image

I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.

I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality.
I guess it's the reason that Cotatron model gives not good alignment...

So, I want to know how to let the Cotatron model get better alignment on unseen sing audio?
Besides, can you provide more training details?

@betty97
Copy link

betty97 commented May 11, 2022

@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants