Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare to MATCHA-TTS #8

Open
Liujingxiu23 opened this issue Mar 20, 2024 · 1 comment
Open

Compare to MATCHA-TTS #8

Liujingxiu23 opened this issue Mar 20, 2024 · 1 comment

Comments

@Liujingxiu23
Copy link

Liujingxiu23 commented Mar 20, 2024

Thank you for your work and sharing!
It seems VoiceFLow-TTS and MATCHA-TTS(https://github.com/shivammehta25/Matcha-TTS/) are very similar?
What is the main diffences between two methods?
And How about the performace on voice quality, for example prosody, and the inference speed?
Best

@cantabile-kwok
Copy link
Member

Yes, the two individual works came out almost the same time (in a week).

Although it seems similar, there are indeed some differences.

  1. The flow matching criterion is slightly different. Matcha-TTS conditions the vector field on a data sample $x_1$, while VoiceFlow adopts another choice that conditions both on $x_0$ and $x_1$.
  2. Further, VoiceFlow aims to compare the algorithmic advantage of bringing "rectified flow" to straighten the trajectory. So the architecture of the model is identical with GradTTS. Meanwhile, Matcha-TTS explores the architecture design in the vector field estimator and proposes a better one.
  3. Additionally, Matcha-TTS incorporates MAS duration learning while VoiceFlow does not in the published version. You can find MAS-relevant code here though.

So, although both works uses flow matching for TTS, our focus is not the same. And it is not saying that one model is ideally better than the other. Actually I highly appreciate how Matcha-TTS is nicely and neatly open-sourced.

If one has to compare, you may use the two repos the train on exactly the same data. Personally I haven't done strict sample-to-sample comparison, but I remember that Matcha-TTS's vector field estimator architecture did achieve a decent performance in our code as well. The inference speed also depends on the architecture, too.

If anybody has done experiments about these, results are welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants