[Feature Request] 2 modes for optimized results/ better quality : singing or speech #76

tomakorea · 2024-07-23T09:16:54Z

Is your feature request related to a problem? Please describe.
I'm mainly use RVC to voice different characters, if most of time it works well enough, in some cases like screams, breath, laughs or vocal fry, the algorithm kind of bug out and can't follow well, make it sound really weird.

Describe the solution you'd like
I'm aware some settings under the hood could be tweaked in order to get better results, however, theses settings aren't displayed to the user. It would be great if we could have some presets to select for inference and training, optimizing the quality of the results for speech or for singing. For example : male speech, female speech, children speech, male singing, female singing, etc. It could cover more accurately the vocal range of each character.

Describe alternatives you've considered
Right now, I found that using checkpoint fusion can help a tiny bit to extend the vocal range, however, the voice isn't faithful to the original anymore.

Additional context
If it's not possible, could we make a pre-trained or a separate breath/scream/laugh model that focuses only on that? then we can blend the "voice noises (like breath, etc.)" model with the speech model of the same character.

fumiama · 2024-07-24T08:00:28Z

Well, if you use larger training dataset including those voices you said, the model may be able to recognize them. Theoretically, the model has the capability to study any voice feature.

tomakorea · 2024-07-24T08:32:45Z

Well, if you use larger training dataset including those voices you said, the model may be able to recognize them. Theoretically, the model has the capability to study any voice feature.

in this case, does it affect training quality? many tutorials said the dataset should have a very coherent and stable voice. However, things like whispering, screams and laughs, are very different from the common spoken voice, even though it's from the same person. So, would it be beneficial to train 2 models? one only for screams/breath and one only for spoken word?

fumiama · 2024-07-24T08:38:26Z

does it affect training quality?

I'm not pretty sure about it because I haven't had a test.

the dataset should have a very coherent and stable voice.

Yes, because the dataset used for training in RVC is usually quite small. The large I said means that the dataset should be the same scale as what trained the pre-trained model.

would it be beneficial to train 2 models?

Maybe, but it's hard to split those parts.

fumiama added the documentation Improvements or additions to documentation label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] 2 modes for optimized results/ better quality : singing or speech #76

[Feature Request] 2 modes for optimized results/ better quality : singing or speech #76

tomakorea commented Jul 23, 2024

fumiama commented Jul 24, 2024

tomakorea commented Jul 24, 2024 •

edited

Loading

fumiama commented Jul 24, 2024

[Feature Request] 2 modes for optimized results/ better quality : singing or speech #76

[Feature Request] 2 modes for optimized results/ better quality : singing or speech #76

Comments

tomakorea commented Jul 23, 2024

fumiama commented Jul 24, 2024

tomakorea commented Jul 24, 2024 • edited Loading

fumiama commented Jul 24, 2024

tomakorea commented Jul 24, 2024 •

edited

Loading