Preliminary results #24
eonglints
started this conversation in
Show and tell
Replies: 2 comments 6 replies
-
Pretty encouraging! Been following to see what early results would look like. Appreciate you taking the time to share |
Beta Was this translation helpful? Give feedback.
2 replies
-
@eonglints wow! thank you for sharing your results! ❤️ 🙏 |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey, so I'm getting some preliminary results with LibriSpeech (a 1k-hour dataset in comparison to the 60k-hour LibriLight dataset used in AudioLM).
I'm using just the Semantic Transformer at the moment, with features extracted using this HuBERT model and clustered using this k-means model (vocab size of 1000, the nearest I could find to AudioLM's 1024). I'm doing the feature extraction offline beforehand to give my GPU(s) a bit more wiggle room.
Now obviously, these "semantic tokens" can't be directly converted into a waveform, and, as discussed in the AudioLM paper, there are several approaches to tackling this. They propose training two further transformers using SoundStream tokens, resulting in the highest quality audio we've seen from the literature exploring these ideas. Given all the hard work that @lucidrains has put into implementing these steps, this is obviously something I'm keen to try out. But in the interest of just getting something out there sooner rather than later, I've opted for another approach in the interim.
Meta AI released a GAN vocoder that uses these semantic tokens and also applies some duration modelling (to account for the fact that we remove duplicate consecutive tokens). It was trained on a single-speaker dataset and is far from perfect, but at least we get to hear something!
The following are generated outputs from the Semantic Transformer with 12 layers, 16 attention heads, a dimension of 1024, drop-out of 0.1, batch size of 128, gradient accumulation of 16. Default settings (build 0.0.57) for everything else. Trained on a single GPU for a few days.
Obviously, there are all the hacky caveats mentioned above, but at least we can see that something is working. While the generated speech isn't yet creating intelligible words, it's definitely creating English-sounding phonemes with appropriate rhythm and prosody.
The first 6 seconds of each of these wave files are the "prompt" and the rest is the continuation.
Thanks so much to @lucidrains for all his hard work!
Audio_vocoded_generated_tokens_00000048.mp4
Audio_vocoded_generated_tokens_00000064.mp4
Audio_vocoded_generated_tokens_00000070.mp4
Beta Was this translation helpful? Give feedback.
All reactions