You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am working on implementing attention rollout for the Hiera model, but I encountered challenges due to Hiera’s hierarchical pooling (q_stride), Mask Unit Attention (MU), and Unroll/Reroll operations.
In standard Vision Transformers (ViTs), attention rollout assumes that the identities of input tokens are linearly combined through the layers based on attention weights. However, Hiera differs in key ways:
Hierarchical Pooling (q_stride)
In many layers, q_stride > 1 applies spatial pooling, reducing the number of tokens before attention.
This means input tokens are not directly mixed linearly through self-attention, breaking the standard assumption of attention rollout.
Mask Unit Attention (MU)
In earlier stages, attention is confined to local mask units, meaning that some tokens never directly interact with others.
This contrasts with standard ViTs, where every token can eventually attend to every other token.
Unroll and Reroll Transformations
Intermediate token representations are spatially reshaped and reordered multiple times using Unroll and Reroll.
This makes it difficult to track token dependencies consistently through layers.
Questions:
Is attention rollout applicable to Hiera given these hierarchical operations?
Since q_stride pools tokens, how should token contributions be propagated correctly?
Can attention rollout be modified to properly handle hierarchical token aggregation?
How can we adapt attention rollout to respect Unroll/Reroll transformations?
Should we first "undo" unrolling before applying attention rollout?
Are there any internal functions in Hiera that can help track token mappings across hierarchy levels?
Would an alternative approach like hierarchical attention flow be more appropriate?
Instead of naive attention propagation, should we track token aggregation across pooling layers before applying rollout?
Any recommendations for how to do this efficiently?
Hello,
I am working on implementing attention rollout for the Hiera model, but I encountered challenges due to Hiera’s hierarchical pooling (q_stride), Mask Unit Attention (MU), and Unroll/Reroll operations.
In standard Vision Transformers (ViTs), attention rollout assumes that the identities of input tokens are linearly combined through the layers based on attention weights. However, Hiera differs in key ways:
Hierarchical Pooling (
q_stride
)q_stride > 1
applies spatial pooling, reducing the number of tokens before attention.Mask Unit Attention (MU)
Unroll and Reroll Transformations
Unroll
andReroll
.Questions:
Is attention rollout applicable to Hiera given these hierarchical operations?
q_stride
pools tokens, how should token contributions be propagated correctly?How can we adapt attention rollout to respect Unroll/Reroll transformations?
Would an alternative approach like hierarchical attention flow be more appropriate?
I appreciate any insights or guidance on this.
refs:
Quantifying Attention Flow in Transformers: https://arxiv.org/pdf/2005.00928
example of visualizing vit's attention: https://www.kaggle.com/code/piantic/vision-transformer-vit-visualize-attention-map
The text was updated successfully, but these errors were encountered: