Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the difference between semantic tokens like HuBERT, wav2vec2 and Whisper #149

Open
r666ay opened this issue Aug 7, 2024 · 0 comments
Open

Comments

@r666ay
Copy link

r666ay commented Aug 7, 2024

Thanks for your great work on WhisperSpeech! It is very interesting to extract semantic tokens from the Whisper encoder, and use semantic tokens to generate acoustic tokens. I am a green hand at TTS, and have some questions about WhisperSpeech. If you could answer these questions, I am very appreciated.

1) how to design the Infomation bottleneck network for semantic tokens? The semantic tokens from HuBERT is extracted by k-means model, and semantic tokens from Whisper encoder is extracted by the VQ model. However, embeddings from k-means or VQ model still contain speaker infomation. But the cluster indexes from k-means or VQ model remains semantic infomation, and do not contains speaker information, so these cluster indexes is called semantic token. My question is, what is the infomation bottleneck during these processing, VQ, or map the VQ embedding into a 1-dimension token?

2) which semantic token is better, semantic token from Wav2vec2, HuBERT, w2v-bert, or Whisper encoder ? During the semantic -> acoustic processing, which semantic token will lead to higher accuracy of predicting acoustic tokens? During the semantic -> text processing, which semantic token will lead to lower WER?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant