Skip to content

Latest commit

 

History

History
78 lines (62 loc) · 3.38 KB

README.md

File metadata and controls

78 lines (62 loc) · 3.38 KB

ViTGaze 👀

Gaze Following with Interaction Features in Vision Transformers

Yuehao Song1 , Xinggang Wang1 📧 , Jingfeng Yao1 , Wenyu Liu1 , Jinglin Zhang2 , Xiangmin Xu3

1 Huazhong University of Science and Technology, 2 Shandong University, 3 South China University of Technology

(📧) corresponding author.

ArXiv Preprint (arXiv 2403.12778)

Demo0 Demo1

News

  • Mar. 25th, 2024: We release an initial version of ViTGaze.
  • Mar. 19th, 2024: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️

Introduction

Plain Vision Transformer could also do gaze following with the simple ViTGaze framework!

framework

Inspired by the remarkable success of pre-trained plain Vision Transformers (ViTs), we introduce a novel single-modality gaze following framework, ViTGaze. In contrast to previous methods, it creates a brand new gaze following framework based mainly on powerful encoders (relative decoder parameter less than 1%). Our principal insight lies in that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement on AUC, 5.1% improvement on AP) and very comparable performance against multi-modality methods with 59% number of parameters less.

Results

Results from the ViTGaze paper

comparison

Results on GazeFollow Results on VideoAttentionTarget
AUC Avg. Dist. Min. Dist. AUC Dist. AP
0.949 0.105 0.047 0.938 0.102 0.905

Corresponding checkpoints are released:

Getting Started

Acknowledgements

ViTGaze is based on detectron2. We use the efficient multi-head attention implemented in the xFormers library.

Citation

If you find ViTGaze is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{vitgaze,
    title={ViTGaze: Gaze Following with Interaction Features in Vision Transformers},
    author={Yuehao Song and Xinggang Wang and Jingfeng Yao and Wenyu Liu and Jinglin Zhang and Xiangmin Xu},
    journal={arXiv preprint arXiv:2403.12778},
    year={2024}
}