Preliminary results #24

eonglints · 2022-12-02T17:49:58Z

eonglints
Dec 2, 2022

Hey, so I'm getting some preliminary results with LibriSpeech (a 1k-hour dataset in comparison to the 60k-hour LibriLight dataset used in AudioLM).

I'm using just the Semantic Transformer at the moment, with features extracted using this HuBERT model and clustered using this k-means model (vocab size of 1000, the nearest I could find to AudioLM's 1024). I'm doing the feature extraction offline beforehand to give my GPU(s) a bit more wiggle room.

Now obviously, these "semantic tokens" can't be directly converted into a waveform, and, as discussed in the AudioLM paper, there are several approaches to tackling this. They propose training two further transformers using SoundStream tokens, resulting in the highest quality audio we've seen from the literature exploring these ideas. Given all the hard work that @lucidrains has put into implementing these steps, this is obviously something I'm keen to try out. But in the interest of just getting something out there sooner rather than later, I've opted for another approach in the interim.

Meta AI released a GAN vocoder that uses these semantic tokens and also applies some duration modelling (to account for the fact that we remove duplicate consecutive tokens). It was trained on a single-speaker dataset and is far from perfect, but at least we get to hear something!

The following are generated outputs from the Semantic Transformer with 12 layers, 16 attention heads, a dimension of 1024, drop-out of 0.1, batch size of 128, gradient accumulation of 16. Default settings (build 0.0.57) for everything else. Trained on a single GPU for a few days.

Obviously, there are all the hacky caveats mentioned above, but at least we can see that something is working. While the generated speech isn't yet creating intelligible words, it's definitely creating English-sounding phonemes with appropriate rhythm and prosody.

The first 6 seconds of each of these wave files are the "prompt" and the rest is the continuation.

Thanks so much to @lucidrains for all his hard work!

Audio_vocoded_generated_tokens_00000048.mp4

Audio_vocoded_generated_tokens_00000064.mp4

Audio_vocoded_generated_tokens_00000070.mp4

StevenSchrembeck · 2022-12-02T21:43:25Z

StevenSchrembeck
Dec 2, 2022

Pretty encouraging! Been following to see what early results would look like. Appreciate you taking the time to share

2 replies

StevenSchrembeck Dec 19, 2022

Btw I'm training now on ~10k hrs captioned sound effects. Not entirely certain if this is normal, but generate is going at ~2 to 4it/s, or roughly half an hour to generate 1 second of audio

Maybe a screwy setup. sample = model.generate(text = ['zombie, car crash, resonant'], batch_size = 1, max_length = 500000)

500,000 / sample rate 16khz = max length ~31.25s from my dataset. seem mis-configured?

eonglints Dec 19, 2022
Author

That's very cool that you're training on sound effects, would love to hear how you get on. I'm sure you've seen this paper already, but it would seem this approach is very similar to the AudioGen paper.

Hmm, half an hour for 1 second of audio definitely seems a bit off... I take it you're definitely generating using a GPU? 31 seconds max length is probably a bit amitious to start with, I'd go for more like 5 seconds.

lucidrains · 2022-12-04T17:52:20Z

lucidrains
Dec 4, 2022
Maintainer

@eonglints wow! thank you for sharing your results! ❤️ 🙏

4 replies

lucidrains Dec 4, 2022
Maintainer

you should get in touch with stability.ai , there is still room for a "stable diffusion moment" for audio!

lucidrains Dec 4, 2022
Maintainer

You could be the next Robin Rombach!

zqevans Dec 5, 2022

Hey @eonglints, I run Stability's audio research lab, Harmonai. I'd love to get in contact and collaborate on this work! We've got lots of really great audio/music generation research going on.

eonglints Dec 5, 2022
Author

Thanks both! I'll get in touch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preliminary results #24

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Preliminary results #24

eonglints Dec 2, 2022

Replies: 2 comments · 6 replies

StevenSchrembeck Dec 2, 2022

StevenSchrembeck Dec 19, 2022

eonglints Dec 19, 2022 Author

lucidrains Dec 4, 2022 Maintainer

lucidrains Dec 4, 2022 Maintainer

lucidrains Dec 4, 2022 Maintainer

zqevans Dec 5, 2022

eonglints Dec 5, 2022 Author

eonglints
Dec 2, 2022

Replies: 2 comments 6 replies

StevenSchrembeck
Dec 2, 2022

eonglints Dec 19, 2022
Author

lucidrains
Dec 4, 2022
Maintainer

lucidrains Dec 4, 2022
Maintainer

lucidrains Dec 4, 2022
Maintainer

eonglints Dec 5, 2022
Author