Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on NIH syndrome: Mimi (new codec from Kyutai-labs, Moshi) + faster whisper + whisper v3 turbo #160

Open
thiswillbeyourgithub opened this issue Oct 5, 2024 · 0 comments

Comments

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Oct 5, 2024

Hi,

Mimi

A few weeks ago Kyutai-labs finally released Moshi, a LLM that also supports STT and TTS in real time. Alongside, they also released Mimi, their speech codec designed for this. Here's the hf link to mimi

I was wondering if this would be relevant to whisperspeech for the future roadmap. Quoting their README:

Mimi builds on previous neural audio codecs such as SoundStream
and EnCodec, adding a Transformer both in the encoder and decoder,
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
average frame rate of text tokens (~3-4 Hz), and limit the number of autoregressive steps in Moshi.
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
a self-supervised representation from WavLM, which allows modeling semantic and acoustic information with a single model. Interestingly, while Mimi is fully causal and streaming, it learns to match sufficiently well the non-causal
representation from WavLM, without introducing any delays. Finally, and similarly to EBEN,
Mimi uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of
subjective quality despite its low bitrate.

FasterWhisper

Additionaly, nobody seems to be using whisper anymore and instead use FasterWhisper which re implements parts of it and make it both faster and more memory efficient. Is this relevant to whisperspeech? Maybe not at all but I prefered to ask.

whisper V3 turbo

Thirdly, openai released this week their v3 turbo model. It seems to be straightforward to implement as I saw it usable on project in days so I was wondering if you were considering using the v3 version in the future.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant