Replies: 3 comments 15 replies
-
I'm looking into progressively training embeddings using the addition scheme. Training progreses from coarser to finer residuals and returns to coarse layers as needed to retain sufficient separation of embeddings. I'm experimenting with this in the context of a recurrent network instead of a transformer network though ... Keen to explore alternatives :-)
… On 19. Apr 2023, at 11.41, Max Kraan ***@***.***> wrote:
In AudioLM and in the code presented in this repo, the quantizer dimension in acoustic token sequences is handled using flattening, concatenating, and offsetting the tokens coming from the different quantizer dimensions before passing them into the transformer network.
With that in mind, I wanted to start a discussion about a potential alternative approach presented in EnCodec. Namely, EnCodec mention that they pass the acoustic token sequences for each quantizer dimension into a separate embedding layer and then add these embeddings together before passing the final input sequence into the transformer network. From section 3.3 in the paper:
For a time step t, the discrete representation obtained at time t − 1 is transformed into a continuous representation using learnt embedding tables, one for each codebook, and which are summed. ... The output of the Transformer is fed into Nq linear layers with as many output channels as the cardinality of each codebook (e.g. 1024), giving us the logits of the estimated distribution over each codebook for time t.
This logic can additionally be found here in their implementation: https://github.com/facebookresearch/encodec/blob/6e8d7eda6fff5b0d589d64f063610c7f6044963e/encodec/model.py#L62
To me this seems like an interesting approach of handling multiple quantizers in AudioLM for a few reasons:
Intuitively I would say it models the residual nature of RVQ codes more appropriately by performing addition instead of concenation
The context length is reduced significantly, this becomes especially important for high fidelity acoustic token modeling since num_quantizers can be up to 16.
It seems to work well for EnCodec as it is used at inference time to predict the acoustic tokens for each quantizer at each time-step instead of passing them through the more expensive RVQ clustering stack. The excerpt below also from section 3.3 is of particular interest here.
We thus neglect potential mutual information between the codebooks at a single time step. This allows to speedup inference (as opposed to having one time step per codebook, or a multi-stage prediction) with a limited impact over the final cross entropy.
Personally I'm planning to start running some experiments to see if this might work when modeling coarse acoustic tokens, however I'd be interested to get some opinions / feedback on this hypothesis from @lucidrains <https://github.com/lucidrains> or anyone else :)
—
Reply to this email directly, view it on GitHub <#168>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3KXB6HWSCO4JXXWHVURM3XB6QKXANCNFSM6AAAAAAXDXMGSI>.
You are receiving this because you are subscribed to this thread.
|
Beta Was this translation helpful? Give feedback.
-
@mxkrn i'm actually partial towards an architecture i've seen tailored to residual vq codes in this paper. they do addition of the embeddings, but then also perform attention across the quantization dimension at the end (like axial attention) and was able to autoregress out the quantization ids in order of coarse to fine. in Encodec's LM scheme, does each token predict the next |
Beta Was this translation helpful? Give feedback.
-
It does yes, if you look at this line of code you can see that the transformer output for one time step is projected into
Looks interesting, will have to look at this in some more detail. From a first glance it looks like a lot of complexity (both the spatial and depth transformer) for something that is being modeled in a much simpler fashion by AudioLM / Encodec. Nonetheless interesting if it can be shown to work better ofc |
Beta Was this translation helpful? Give feedback.
-
In AudioLM and in the code presented in this repo, the quantizer dimension in acoustic token sequences is handled using flattening, concatenating, and offsetting the tokens coming from the different quantizer dimensions before passing them into the transformer network.
With that in mind, I wanted to start a discussion about a potential alternative approach presented in EnCodec. Namely, EnCodec mention that they pass the acoustic token sequences for each quantizer dimension into a separate embedding layer and then add these embeddings together before passing the final input sequence into the transformer network. From section 3.3 in the paper:
This logic can additionally be found here in their implementation: https://github.com/facebookresearch/encodec/blob/6e8d7eda6fff5b0d589d64f063610c7f6044963e/encodec/model.py#L62
To me this seems like an interesting approach of handling multiple quantizers in AudioLM for a few reasons:
num_quantizers
can be up to 16.Personally I'm planning to start running some experiments to see if this might work when modeling coarse acoustic tokens, however I'd be interested to get some opinions / feedback on this hypothesis from @lucidrains or anyone else :)
Beta Was this translation helpful? Give feedback.
All reactions