Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] 2 modes for optimized results/ better quality : singing or speech #76

Open
tomakorea opened this issue Jul 23, 2024 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@tomakorea
Copy link

Is your feature request related to a problem? Please describe.
I'm mainly use RVC to voice different characters, if most of time it works well enough, in some cases like screams, breath, laughs or vocal fry, the algorithm kind of bug out and can't follow well, make it sound really weird.

Describe the solution you'd like
I'm aware some settings under the hood could be tweaked in order to get better results, however, theses settings aren't displayed to the user. It would be great if we could have some presets to select for inference and training, optimizing the quality of the results for speech or for singing. For example : male speech, female speech, children speech, male singing, female singing, etc. It could cover more accurately the vocal range of each character.

Describe alternatives you've considered
Right now, I found that using checkpoint fusion can help a tiny bit to extend the vocal range, however, the voice isn't faithful to the original anymore.

Additional context
If it's not possible, could we make a pre-trained or a separate breath/scream/laugh model that focuses only on that? then we can blend the "voice noises (like breath, etc.)" model with the speech model of the same character.

@fumiama fumiama added the documentation Improvements or additions to documentation label Jul 24, 2024
@fumiama
Copy link
Owner

fumiama commented Jul 24, 2024

Well, if you use larger training dataset including those voices you said, the model may be able to recognize them. Theoretically, the model has the capability to study any voice feature.

@tomakorea
Copy link
Author

tomakorea commented Jul 24, 2024

Well, if you use larger training dataset including those voices you said, the model may be able to recognize them. Theoretically, the model has the capability to study any voice feature.

in this case, does it affect training quality? many tutorials said the dataset should have a very coherent and stable voice. However, things like whispering, screams and laughs, are very different from the common spoken voice, even though it's from the same person. So, would it be beneficial to train 2 models? one only for screams/breath and one only for spoken word?

@fumiama
Copy link
Owner

fumiama commented Jul 24, 2024

does it affect training quality?

I'm not pretty sure about it because I haven't had a test.

the dataset should have a very coherent and stable voice.

Yes, because the dataset used for training in RVC is usually quite small. The large I said means that the dataset should be the same scale as what trained the pre-trained model.

would it be beneficial to train 2 models?

Maybe, but it's hard to split those parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants