Concatenation vs. addition of acoustic token quantizer dimension #168

mxkrn · 2023-04-19T08:41:03Z

mxkrn
Apr 19, 2023

In AudioLM and in the code presented in this repo, the quantizer dimension in acoustic token sequences is handled using flattening, concatenating, and offsetting the tokens coming from the different quantizer dimensions before passing them into the transformer network.

With that in mind, I wanted to start a discussion about a potential alternative approach presented in EnCodec. Namely, EnCodec mention that they pass the acoustic token sequences for each quantizer dimension into a separate embedding layer and then add these embeddings together before passing the final input sequence into the transformer network. From section 3.3 in the paper:

For a time step t, the discrete representation obtained at time t − 1 is transformed into a continuous representation using learnt embedding tables, one for each codebook, and which are summed. ... The output of the Transformer is fed into Nq linear layers with as many output channels as the cardinality of each codebook (e.g. 1024), giving us the logits of the estimated distribution over each codebook for time t.

This logic can additionally be found here in their implementation: https://github.com/facebookresearch/encodec/blob/6e8d7eda6fff5b0d589d64f063610c7f6044963e/encodec/model.py#L62

To me this seems like an interesting approach of handling multiple quantizers in AudioLM for a few reasons:

Intuitively I would say it models the residual nature of RVQ codes more appropriately by performing addition instead of concenation
The context length is reduced significantly, this becomes especially important for high fidelity acoustic token modeling since num_quantizers can be up to 16.
It seems to work well for EnCodec as it is used at inference time to predict the acoustic tokens for each quantizer at each time-step instead of passing them through the more expensive RVQ clustering stack. The excerpt below also from section 3.3 is of particular interest here.

We thus neglect potential mutual information between the codebooks at a single time step. This allows to speedup inference (as opposed to having one time step per codebook, or a multi-stage prediction) with a limited impact over the final cross entropy.

Personally I'm planning to start running some experiments to see if this might work when modeling coarse acoustic tokens, however I'd be interested to get some opinions / feedback on this hypothesis from @lucidrains or anyone else :)

seppokeronen · 2023-04-19T10:15:44Z

seppokeronen
Apr 19, 2023

I'm looking into progressively training embeddings using the addition scheme. Training progreses from coarser to finer residuals and returns to coarse layers as needed to retain sufficient separation of embeddings. I'm experimenting with this in the context of a recurrent network instead of a transformer network though ... Keen to explore alternatives :-)

…

On 19. Apr 2023, at 11.41, Max Kraan ***@***.***> wrote: In AudioLM and in the code presented in this repo, the quantizer dimension in acoustic token sequences is handled using flattening, concatenating, and offsetting the tokens coming from the different quantizer dimensions before passing them into the transformer network. With that in mind, I wanted to start a discussion about a potential alternative approach presented in EnCodec. Namely, EnCodec mention that they pass the acoustic token sequences for each quantizer dimension into a separate embedding layer and then add these embeddings together before passing the final input sequence into the transformer network. From section 3.3 in the paper: For a time step t, the discrete representation obtained at time t − 1 is transformed into a continuous representation using learnt embedding tables, one for each codebook, and which are summed. ... The output of the Transformer is fed into Nq linear layers with as many output channels as the cardinality of each codebook (e.g. 1024), giving us the logits of the estimated distribution over each codebook for time t. This logic can additionally be found here in their implementation: https://github.com/facebookresearch/encodec/blob/6e8d7eda6fff5b0d589d64f063610c7f6044963e/encodec/model.py#L62 To me this seems like an interesting approach of handling multiple quantizers in AudioLM for a few reasons: Intuitively I would say it models the residual nature of RVQ codes more appropriately by performing addition instead of concenation The context length is reduced significantly, this becomes especially important for high fidelity acoustic token modeling since num_quantizers can be up to 16. It seems to work well for EnCodec as it is used at inference time to predict the acoustic tokens for each quantizer at each time-step instead of passing them through the more expensive RVQ clustering stack. The excerpt below also from section 3.3 is of particular interest here. We thus neglect potential mutual information between the codebooks at a single time step. This allows to speedup inference (as opposed to having one time step per codebook, or a multi-stage prediction) with a limited impact over the final cross entropy. Personally I'm planning to start running some experiments to see if this might work when modeling coarse acoustic tokens, however I'd be interested to get some opinions / feedback on this hypothesis from @lucidrains <https://github.com/lucidrains> or anyone else :) — Reply to this email directly, view it on GitHub <#168>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3KXB6HWSCO4JXXWHVURM3XB6QKXANCNFSM6AAAAAAXDXMGSI>. You are receiving this because you are subscribed to this thread.

0 replies

lucidrains · 2023-04-19T15:19:11Z

lucidrains
Apr 19, 2023
Maintainer

In AudioLM and in the code presented in this repo, the quantizer dimension in acoustic token sequences is handled using flattening, concatenating, and offsetting the tokens coming from the different quantizer dimensions before passing them into the transformer network.

With that in mind, I wanted to start a discussion about a potential alternative approach presented in EnCodec. Namely, EnCodec mention that they pass the acoustic token sequences for each quantizer dimension into a separate embedding layer and then add these embeddings together before passing the final input sequence into the transformer network. From section 3.3 in the paper:

For a time step t, the discrete representation obtained at time t − 1 is transformed into a continuous representation using learnt embedding tables, one for each codebook, and which are summed. ... The output of the Transformer is fed into Nq linear layers with as many output channels as the cardinality of each codebook (e.g. 1024), giving us the logits of the estimated distribution over each codebook for time t.

This logic can additionally be found here in their implementation: https://github.com/facebookresearch/encodec/blob/6e8d7eda6fff5b0d589d64f063610c7f6044963e/encodec/model.py#L62

To me this seems like an interesting approach of handling multiple quantizers in AudioLM for a few reasons:

Intuitively I would say it models the residual nature of RVQ codes more appropriately by performing addition instead of concenation

The context length is reduced significantly, this becomes especially important for high fidelity acoustic token modeling since num_quantizers can be up to 16.

It seems to work well for EnCodec as it is used at inference time to predict the acoustic tokens for each quantizer at each time-step instead of passing them through the more expensive RVQ clustering stack. The excerpt below also from section 3.3 is of particular interest here.

We thus neglect potential mutual information between the codebooks at a single time step. This allows to speedup inference (as opposed to having one time step per codebook, or a multi-stage prediction) with a limited impact over the final cross entropy.

Personally I'm planning to start running some experiments to see if this might work when modeling coarse acoustic tokens, however I'd be interested to get some opinions / feedback on this hypothesis from @lucidrains or anyone else :)

@mxkrn i'm actually partial towards an architecture i've seen tailored to residual vq codes in this paper. they do addition of the embeddings, but then also perform attention across the quantization dimension at the end (like axial attention) and was able to autoregress out the quantization ids in order of coarse to fine.

in Encodec's LM scheme, does each token predict the next Nq tokens?

3 replies

lucidrains May 17, 2023
Maintainer

haha, speak of the devil. they were writing a paper on this very approach https://github.com/lucidrains/MEGABYTE-pytorch

lucidrains May 17, 2023
Maintainer

i will also build out https://github.com/lucidrains/soundstorm-pytorch, which also addresses this through non-autoregressive means

mxkrn May 17, 2023
Author

Oh yea I also saw that come by today, thanks for the heads up will keep an eye on the new repo

mxkrn · 2023-04-19T15:43:05Z

mxkrn
Apr 19, 2023
Author

in Encodec's LM scheme, does each token predict the next Nq tokens?

It does yes, if you look at this line of code you can see that the transformer output for one time step is projected into Nq logits corresponding to the tokens of each quantizer.

i'm actually partial towards an architecture i've seen tailored to residual vq codes in this paper.

Looks interesting, will have to look at this in some more detail. From a first glance it looks like a lot of complexity (both the spatial and depth transformer) for something that is being modeled in a much simpler fashion by AudioLM / Encodec. Nonetheless interesting if it can be shown to work better ofc

12 replies

lucidrains Jun 20, 2023
Maintainer

It is this recent paper https://arxiv.org/abs/2305.19466

Yup agreed soundstorm looks promising!

mxkrn Jun 22, 2023
Author

Potentially somewhat off-topic in this thread but given this paper, do you think there's room for a version of SoundStorm that uses T5 rel. pos. instead? Although these tasks are not exactly comparable, it does look like RoPE consistently underperforms other approaches

lucidrains Jun 22, 2023
Maintainer

yea sure, do you want to raise this issue at the soundstorm repository, for keeping track?

mxkrn Jun 26, 2023
Author

lucidrains/soundstorm-pytorch#15

lucidrains Jun 26, 2023
Maintainer

@mxkrn ok it is done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenation vs. addition of acoustic token quantizer dimension #168

{{title}}

Replies: 3 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Concatenation vs. addition of acoustic token quantizer dimension #168

mxkrn Apr 19, 2023

Replies: 3 comments · 15 replies

seppokeronen Apr 19, 2023

lucidrains Apr 19, 2023 Maintainer

lucidrains May 17, 2023 Maintainer

lucidrains May 17, 2023 Maintainer

mxkrn May 17, 2023 Author

mxkrn Apr 19, 2023 Author

lucidrains Jun 20, 2023 Maintainer

mxkrn Jun 22, 2023 Author

lucidrains Jun 22, 2023 Maintainer

mxkrn Jun 26, 2023 Author

lucidrains Jun 26, 2023 Maintainer

mxkrn
Apr 19, 2023

Replies: 3 comments 15 replies

seppokeronen
Apr 19, 2023

lucidrains
Apr 19, 2023
Maintainer

lucidrains May 17, 2023
Maintainer

lucidrains May 17, 2023
Maintainer

mxkrn May 17, 2023
Author

mxkrn
Apr 19, 2023
Author

lucidrains Jun 20, 2023
Maintainer

mxkrn Jun 22, 2023
Author

lucidrains Jun 22, 2023
Maintainer

mxkrn Jun 26, 2023
Author

lucidrains Jun 26, 2023
Maintainer