Question on NIH syndrome: Mimi (new codec from Kyutai-labs, Moshi) + faster whisper + whisper v3 turbo #160

thiswillbeyourgithub · 2024-10-05T10:09:11Z

Hi,

Mimi

A few weeks ago Kyutai-labs finally released Moshi, a LLM that also supports STT and TTS in real time. Alongside, they also released Mimi, their speech codec designed for this. Here's the hf link to mimi

I was wondering if this would be relevant to whisperspeech for the future roadmap. Quoting their README:

Mimi builds on previous neural audio codecs such as SoundStream
and EnCodec, adding a Transformer both in the encoder and decoder,
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
average frame rate of text tokens (~3-4 Hz), and limit the number of autoregressive steps in Moshi.
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
a self-supervised representation from WavLM, which allows modeling semantic and acoustic information with a single model. Interestingly, while Mimi is fully causal and streaming, it learns to match sufficiently well the non-causal
representation from WavLM, without introducing any delays. Finally, and similarly to EBEN,
Mimi uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of
subjective quality despite its low bitrate.

FasterWhisper

Additionaly, nobody seems to be using whisper anymore and instead use FasterWhisper which re implements parts of it and make it both faster and more memory efficient. Is this relevant to whisperspeech? Maybe not at all but I prefered to ask.

whisper V3 turbo

Thirdly, openai released this week their v3 turbo model. It seems to be straightforward to implement as I saw it usable on project in days so I was wondering if you were considering using the v3 version in the future.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on NIH syndrome: Mimi (new codec from Kyutai-labs, Moshi) + faster whisper + whisper v3 turbo #160

Question on NIH syndrome: Mimi (new codec from Kyutai-labs, Moshi) + faster whisper + whisper v3 turbo #160

thiswillbeyourgithub commented Oct 5, 2024 •

edited

Loading

Question on NIH syndrome: Mimi (new codec from Kyutai-labs, Moshi) + faster whisper + whisper v3 turbo #160

Question on NIH syndrome: Mimi (new codec from Kyutai-labs, Moshi) + faster whisper + whisper v3 turbo #160

Comments

thiswillbeyourgithub commented Oct 5, 2024 • edited Loading

Mimi

FasterWhisper

whisper V3 turbo

thiswillbeyourgithub commented Oct 5, 2024 •

edited

Loading