Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to apply attention rollout in Hiera? #43

Open
jefflai0412 opened this issue Feb 19, 2025 · 0 comments
Open

Is it possible to apply attention rollout in Hiera? #43

jefflai0412 opened this issue Feb 19, 2025 · 0 comments

Comments

@jefflai0412
Copy link

jefflai0412 commented Feb 19, 2025

Hello,
I am working on implementing attention rollout for the Hiera model, but I encountered challenges due to Hiera’s hierarchical pooling (q_stride), Mask Unit Attention (MU), and Unroll/Reroll operations.

In standard Vision Transformers (ViTs), attention rollout assumes that the identities of input tokens are linearly combined through the layers based on attention weights. However, Hiera differs in key ways:

  1. Hierarchical Pooling (q_stride)

    • In many layers, q_stride > 1 applies spatial pooling, reducing the number of tokens before attention.
    • This means input tokens are not directly mixed linearly through self-attention, breaking the standard assumption of attention rollout.
  2. Mask Unit Attention (MU)

    • In earlier stages, attention is confined to local mask units, meaning that some tokens never directly interact with others.
    • This contrasts with standard ViTs, where every token can eventually attend to every other token.
  3. Unroll and Reroll Transformations

    • Intermediate token representations are spatially reshaped and reordered multiple times using Unroll and Reroll.
    • This makes it difficult to track token dependencies consistently through layers.

Questions:

  1. Is attention rollout applicable to Hiera given these hierarchical operations?

    • Since q_stride pools tokens, how should token contributions be propagated correctly?
    • Can attention rollout be modified to properly handle hierarchical token aggregation?
  2. How can we adapt attention rollout to respect Unroll/Reroll transformations?

    • Should we first "undo" unrolling before applying attention rollout?
    • Are there any internal functions in Hiera that can help track token mappings across hierarchy levels?
  3. Would an alternative approach like hierarchical attention flow be more appropriate?

    • Instead of naive attention propagation, should we track token aggregation across pooling layers before applying rollout?
    • Any recommendations for how to do this efficiently?

I appreciate any insights or guidance on this.

refs:
Quantifying Attention Flow in Transformers: https://arxiv.org/pdf/2005.00928
example of visualizing vit's attention: https://www.kaggle.com/code/piantic/vision-transformer-vit-visualize-attention-map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant