Is it possible to apply attention rollout in Hiera? #43

jefflai0412 · 2025-02-19T08:46:15Z

Hello,
I am working on implementing attention rollout for the Hiera model, but I encountered challenges due to Hiera’s hierarchical pooling (q_stride), Mask Unit Attention (MU), and Unroll/Reroll operations.

In standard Vision Transformers (ViTs), attention rollout assumes that the identities of input tokens are linearly combined through the layers based on attention weights. However, Hiera differs in key ways:

Hierarchical Pooling (q_stride)
- In many layers, q_stride > 1 applies spatial pooling, reducing the number of tokens before attention.
- This means input tokens are not directly mixed linearly through self-attention, breaking the standard assumption of attention rollout.
Mask Unit Attention (MU)
- In earlier stages, attention is confined to local mask units, meaning that some tokens never directly interact with others.
- This contrasts with standard ViTs, where every token can eventually attend to every other token.
Unroll and Reroll Transformations
- Intermediate token representations are spatially reshaped and reordered multiple times using Unroll and Reroll.
- This makes it difficult to track token dependencies consistently through layers.

Questions:

Is attention rollout applicable to Hiera given these hierarchical operations?
- Since q_stride pools tokens, how should token contributions be propagated correctly?
- Can attention rollout be modified to properly handle hierarchical token aggregation?
How can we adapt attention rollout to respect Unroll/Reroll transformations?
- Should we first "undo" unrolling before applying attention rollout?
- Are there any internal functions in Hiera that can help track token mappings across hierarchy levels?
Would an alternative approach like hierarchical attention flow be more appropriate?
- Instead of naive attention propagation, should we track token aggregation across pooling layers before applying rollout?
- Any recommendations for how to do this efficiently?

I appreciate any insights or guidance on this.

refs:
Quantifying Attention Flow in Transformers: https://arxiv.org/pdf/2005.00928
example of visualizing vit's attention: https://www.kaggle.com/code/piantic/vision-transformer-vit-visualize-attention-map

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to apply attention rollout in Hiera? #43

Is it possible to apply attention rollout in Hiera? #43

jefflai0412 commented Feb 19, 2025 •

edited

Loading

Is it possible to apply attention rollout in Hiera? #43

Is it possible to apply attention rollout in Hiera? #43

Comments

jefflai0412 commented Feb 19, 2025 • edited Loading

Questions:

jefflai0412 commented Feb 19, 2025 •

edited

Loading