You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do Attention Heads Compete or Cooperate during Counting?
Published Date
2025-02-10
Source
arXiv
Head Name
Mixed Head
Summary
Innovation: The paper investigates the behavior of attention heads in transformers when performing a counting task, revealing that attention heads act as a pseudo-ensemble, each solving the same subtask (i.e., counting), but requiring non-uniform aggregation to conform to task syntax.
Tasks: The study involves training small transformers on the Count01 language, a simple counting task where the model determines if a string has more '1's than '0's, with '2's as noise. The analysis focuses on understanding whether attention heads compete or cooperate during this task.
Significant Result: The research shows that while attention heads individually learn to solve the counting task, they do so as a pseudo-ensemble. The output layer must aggregate their outputs non-uniformly to address both the main task (counting) and an auxiliary syntactic task (ending the sentence).
The text was updated successfully, but these errors were encountered:
Title
Do Attention Heads Compete or Cooperate during Counting?
Published Date
2025-02-10
Source
arXiv
Head Name
Mixed Head
Summary
Innovation: The paper investigates the behavior of attention heads in transformers when performing a counting task, revealing that attention heads act as a pseudo-ensemble, each solving the same subtask (i.e., counting), but requiring non-uniform aggregation to conform to task syntax.
Tasks: The study involves training small transformers on the Count01 language, a simple counting task where the model determines if a string has more '1's than '0's, with '2's as noise. The analysis focuses on understanding whether attention heads compete or cooperate during this task.
Significant Result: The research shows that while attention heads individually learn to solve the counting task, they do so as a pseudo-ensemble. The output layer must aggregate their outputs non-uniformly to address both the main task (counting) and an auxiliary syntactic task (ending the sentence).
The text was updated successfully, but these errors were encountered: