Skip to content

Latest commit

 

History

History
2503 lines (2442 loc) · 439 KB

README.md

File metadata and controls

2503 lines (2442 loc) · 439 KB

Ultimate-Awesome-Transformer-Attention Awesome

This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites.
This list is maintained by Min-Hung Chen. (Actively keep updating)

If you find some ignored papers, feel free to create pull requests, open issues, or email me.
Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider citing and ★STARing this list.
Feel free to share this list with others!

[Update: January, 2024] Added all the related papers from NeurIPS 2023!
[Update: December, 2023] Added all the related papers from ICCV 2023!
[Update: September, 2023] Split the multi-modal paper list to README_multimodal.md
[Update: June, 2023] Added all the related papers from ICML 2023!
[Update: June, 2023] Added all the related papers from CVPR 2023!
[Update: February, 2023] Added all the related papers from ICLR 2023!
[Update: December, 2022] Added attention-free papers from Networks Beyond Attention (GitHub) made by Jianwei Yang
[Update: November, 2022] Added all the related papers from NeurIPS 2022!
[Update: October, 2022] Split the 2nd half of the paper list to README_2.md
[Update: October, 2022] Added all the related papers from ECCV 2022!
[Update: September, 2022] Added the Transformer tutorial slides made by Lucas Beyer!
[Update: June, 2022] Added all the related papers from CVPR 2022!


Overview

------ (The following papers are moved to README_multimodal.md) ------

------ (The following papers are moved to README_2.md) ------


Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Survey

  • "A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 (Purdue). [Paper][GitHub]
  • "Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 (Tencent). [Paper][GitHub]
  • "From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 (Newcastle University, UK). [Paper][GitHub]
  • "When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 (Oxford). [Paper][GitHub]
  • "Foundation Models for Video Understanding: A Survey", arXiv, 2024 (Aalborg University, Denmark). [Paper][GitHub]
  • "Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 (Chongqing University). [Paper][GitHub]
  • "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 (GigaAI, China). [Paper][GitHub]
  • "Video Diffusion Models: A Survey", arXiv, 2024 (Bielefeld University, Germany). [Paper][GitHub]
  • "Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 (Lehigh + UPenn). [Paper]
  • "Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 (NUS). [Paper][GitHub]
  • "A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 (HKUST). [Paper][GitHub]
  • "State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 (Anhui University). [Paper][GitHub]
  • "Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 (IIT Patna). [Paper]
  • "From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 (UIUC). [Paper][GitHub]
  • "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 (Northeastern). [Paper]
  • "Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 (Kyung Hee University). [Paper]
  • "Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 (Beijing University of Posts and Telecommunications). [Paper][GitHub]
  • "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 (Lehigh University, Pennsylvania). [Paper][GitHub]
  • "Large Multimodal Agents: A Survey", arXiv, 2024 (CUHK). [Paper][GitHub]
  • "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 (BIGAI). [Paper][GitHub]
  • "Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 (Qufu Normal University, China). [Paper]
  • "The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 (University of Modena and Reggio Emilia (UniMoRE), Italy). [Paper]
  • "Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 (Westlake University, China). [Paper][GitHub]
  • "Transformer for Object Re-Identification: A Survey", arXiv, 2024 (Wuhan University). [Paper]
  • "Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 (Huawei). [Paper][GtiHub]
  • "MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 (Tencent). [Paper]
  • "From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 (Shanghai AI Lab). [Paper]
  • "A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 (Huawei). [Paper]
  • "A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 (Motional, Massachusetts). [Paper]
  • "A Survey on Transformer Compression", arXiv, 2024 (Huawei). [Paper]
  • "Vision + Language Applications: A Survey", CVPRW, 2023 (Ritsumeikan University, Japan). [Paper][GitHub]
  • "Multimodal Learning With Transformers: A Survey", TPAMI, 2023 (Tsinghua & Oxford). [Paper]
  • "A Survey of Visual Transformers", TNNLS, 2023 (CAS). [Paper][GitHub]
  • "Video Understanding with Large Language Models: A Survey", arXiv, 2023 (University of Rochester). [Paper][GitHub]
  • "Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 (NTU, Singapore). [Paper]
  • "A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 (Huawei). [Paper][GitHub]
  • "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 (Tencent). [Paper]GitHub]
  • "Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 (JHU). [Paper]
  • "Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 (Institute for Research in Fundamental Sciences (IPM), Iran). [Paper]
  • "Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 (Tencent). [Paper][GitHub (in construction)]
  • "Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 (York University). [Paper]
  • "Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 (valeo.ai, France). [Paper][GitHub]
  • "A Survey on Video Diffusion Models", arXiv, 2023 (Fudan). [Paper][GitHub]
  • "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 (Microsoft). [Paper]
  • "Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 (Microsoft). [Paper]
  • "Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 (University of Western Australia). [Paper]
  • "RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 (University of Sydney). [Paper]
  • "A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 (The University of Sydney). [Paper]
  • "From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 (UESTC). [Paper]
  • "Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 (MBZUAI). [Paper][GitHub]
  • "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 (Oxford). [Paper]
  • "Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 (Xi'an Jiaotong University). [Paper]
  • "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 (HKUST). [Paper]
  • "Transformers in Reinforcement Learning: A Survey", arXiv, 2023 (Mila). [Paper]
  • "Vision Language Transformers: A Survey", arXiv, 2023 (Boise State University, Idaho). [Paper]
  • "Towards Open Vocabulary Learning: A Survey", arXiv, 2023 (Peking). [Paper][GitHub]
  • "Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 (Microsoft). [Paper]
  • "A Survey on Multimodal Large Language Models", arXiv, 2023 (USTC). [Paper][GitHub]
  • "2D Object Detection with Transformers: A Review", arXiv, 2023 (German Research Center for Artificial Intelligence, Germany). [Paper]
  • "Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 (Eldorado’s Institute of Technology, Brazil). [Paper]
  • "Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 (NYU). [Paper]
  • "Visual Tuning", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper]
  • "Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 (Fudan University). [Paper]
  • "Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 (University of Peradeniya, Sri Lanka). [Paper]
  • "A Review of Deep Learning for Video Captioning", arXiv, 2023 (Deakin University, Australia). [Paper]
  • "Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 (NTU, Singapore). [Paper][GitHub]
  • "Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 (?). [Paper][GitHub (in construction)]
  • "Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 (KAIST). [Paper]
  • "Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 (Berkeley + Google). [Paper]
  • "Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 (RWTH Aachen University, Germany). [Paper][GitHub]
  • "Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (IBM). [Paper][GitHub]
  • "Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (Indian Institute of Information Technology). [Paper]
  • "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 (Pengcheng Laboratory). [Paper][GitHub]
  • "A Survey on Visual Transformer", TPAMI, 2022 (Huawei). [Paper]
  • "Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 (Tsinghua University, China). [Paper][Springer][Github]
  • "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
  • "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (The University of Sydney). [Paper]
  • "Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (Covenant University, Nigeria). [Paper]
  • "A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (Sejong University). [Paper]
  • "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft). [Paper]
  • "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago). [Paper]
  • "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia). [Paper]
  • "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS). [Paper]
  • "Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI). [Paper][Github]
  • "Medical image analysis based on transformer: A Review", arXiv, 2022 (NUS, Singapore). [Paper]
  • "3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
  • "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU). [Paper]
  • "Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
  • "Multimodal Learning with Transformers: A Survey", arXiv, 2022 (Oxford). [Paper]
  • "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (CAS). [Paper]
  • "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo). [Paper]
  • "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (INESC TEC and University of Porto, Portugal). [Paper]
  • "Efficient Transformers: A Survey", arXiv, 2022 (Google). [Paper]
  • "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (Tsinghua). [Paper]
  • "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan). [Paper]
  • "Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain). [Paper]
  • "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (Nanjing University). [Paper]
  • "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (?). [Paper]
  • "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University). [Paper]
  • "Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba). [Paper][GitHub]
  • "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt). [Paper]
  • "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI). [Paper]
  • "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (Renmin University of China). [Paper]
  • "A Survey of Transformers", arXiv, 2021 (Fudan). [Paper]
  • "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (University of Kashmir, India). [Paper]

[Back to Overview]

Image Classification / Backbone

Replace Conv w/ Attention

Pure Attention

Conv-stem + Attention

  • GSA-Net: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (lucidrains)]
  • HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper][PyTorch (lucidrains)]
  • CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
  • HAT-Net: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (ETHZ). [Paper][PyTorch (in construction)]

Conv + Attention

[Back to Overview]

Vision Transformer

General Vision Transformer

  • ViT: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (Google). [Paper][Tensorflow][PyTorch (lucidrains)][JAX (conceptofmind)]
  • Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
  • PiT: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (NAVER). [Paper][PyTorch]
  • VT: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (Facebook). [Paper][PyTorch (tahmid0007)]
  • PVT: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • iRPE: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • CaiT: "Going deeper with Image Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • Swin-Transformer: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (Microsoft). [Paper][PyTorch][PyTorch (berniwal)]
  • T2T-ViT: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (Yitu). [Paper][PyTorch]
  • FFNBN: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (Microsoft). [Paper]
  • DPT: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (CAS). [Paper][PyTorch]
  • Focal: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
  • XCiT: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (Facebook). [Paper]
  • Twins: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (Meituan). [Paper][PyTorch)]
  • ARM: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (Amazon). [Paper][GitHub (in construction)]
  • DVT: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch]
  • Aug-S: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (Huawei). [Paper]
  • TNT: "Transformer in Transformer", NeurIPS, 2021 (Huawei). [Paper][PyTorch][PyTorch (lucidrains)]
  • ViTAE: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (The University of Sydney). [Paper][PyTorch]
  • DeepViT: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 (NUS + ByteDance). [Paper][Code]
  • So-ViT: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (Dalian University of Technology). [Paper][PyTorch]
  • LV-ViT: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (ByteDance). [Paper][PyTorch]
  • NesT: "Aggregating Nested Transformers", arXiv, 2021 (Google). [Paper][Tensorflow]
  • KVT: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (Alibaba). [Paper]
  • Refined-ViT: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
  • Shuffle-Transformer: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 (Tencent). [Paper]
  • CAT: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 (KuaiShou). [Paper][PyTorch]
  • V-MoE: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 (Google). [Paper]
  • P2T: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (Nankai University). [Paper]
  • PvTv2: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (Nanjing University). [Paper][PyTorch]
  • LG-Transformer: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (IIAI, UAE). [Paper]
  • ViP: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (Oxford). [Paper]
  • Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba). [Paper]
  • LIT: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (Monash University). [Paper][PyTorch]
  • DTN: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (Tencent). [Paper][PyTorch (in construction)]
  • RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson). [Paper][PyTorch]
  • CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University). [Paper][PyTorch]
  • ?: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin). [Paper]
  • ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google). [Paper]
  • CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST). [Paper][PyTorch]
  • Diverse-ViT: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (UT Austin). [Paper][PyTorch]
  • DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China). [Paper][PyTorch (in construction)]
  • MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu). [Paper][Paddle]
  • DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
  • Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • MSG-Transformer: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (Huazhong University of Science & Technology). [Paper][PyTorch]
  • NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS). [Paper][PyTorch]
  • PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei). [Paper][PyTorch]
  • X-ViT: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (Kakao). [Paper]
  • ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST). [Paper][PyTorch]
  • UN: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (Hikvision). [Paper][Code (in construction)]
  • Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD). [Paper][PyTorch]
  • DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance). [Paper]
  • MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
  • VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
  • ?: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (Microsoft). [Paper]
  • Ortho: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (CAS). [Paper]
  • PerViT: "Peripheral Vision Transformer", NeurIPS, 2022 (POSTECH). [Paper]
  • LITv2: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (Monash University). [Paper][PyTorch]
  • BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS). [Paper]
  • O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (East China Normal University). [Paper]
  • MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (University of Kansas). [Paper][PyTorch]
  • BOAT: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (Baidu + HKU). [Paper]
  • ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (The University of Sydney). [Paper]
  • HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
  • PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google). [Paper]
  • DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu). [Paper]
  • NAT: "Neighborhood Attention Transformer", arXiv, 2022 (Oregon). [Paper][PyTorch]
  • ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan). [Paper][PyTorch (in construction)]
  • SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba). [Paper]
  • EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University). [Paper]
  • LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan). [Paper]
  • Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD). [Paper][PyTorch]
  • MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, Greece). [Paper]
  • MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu). [Paper]
  • AEWin: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (Southwest Jiaotong University). [Paper]
  • GrafT: "Grafting Vision Transformers", arXiv, 2022 (Stony Brook). [Paper]
  • ?: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (The University of Sydney). [Paper]
  • LTH-ViT: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (Northeastern University, China). [Paper]
  • TT: "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
  • INTERN: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (Shanghai AI Lab). [Paper][Website]
  • GGeM: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (NAVER). [Paper]
  • GPViT: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (University of Edinburgh, Scotland + UCSD). [Paper][PyTorch]
  • CPVT: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (Meituan). [Paper][Code (in construction)]
  • LipsFormer: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (IDEA, China). [Paper][Code (in construction)]
  • BiFormer: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 (CUHK). [Paper][PyTorch]
  • AbSViT: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 (Berkeley). [Paper][PyTorch][Website]
  • DependencyViT: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 (MIT). [Paper][Code (in construction)]
  • ResFormer: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 (Fudan). [Paper][PyTorch (in construction)]
  • SViT: "Vision Transformer with Super Token Sampling", CVPR, 2023 (CAS). [Paper]
  • PaCa-ViT: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 (NC State). [Paper][PyTorch]
  • GC-ViT: "Global Context Vision Transformers", ICML, 2023 (NVIDIA). [Paper][PyTorch]
  • MAGNETO: "MAGNETO: A Foundation Transformer", ICML, 2023 (Microsoft). [Paper]
  • Fcaformer: "Fcaformer: Forward Cross Attention in Hybrid Vision Transformer", ICCV, 2023 (Intellifusion, China). [Paper][PyTorch]
  • SMT: "Scale-Aware Modulation Meet Transformer", ICCV, 2023 (Alibaba). [Paper][PyTorch]
  • FLatten-Transformer: "FLatten Transformer: Vision Transformer using Focused Linear Attention", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
  • Path-Ensemble: "Revisiting Vision Transformer from the View of Path Ensemble", ICCV, 2023 (Alibaba). [Paper]
  • SG-Former: "SG-Former: Self-guided Transformer with Evolving Token Reallocation", ICCV, 2023 (NUS). [Paper][PyTorch]
  • SimPool: "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?", ICCV, 2023 (National Technical University of Athens). [Paper]
  • LaPE: "LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization", ICCV, 2023 (Peking). [Paper][PyTorch]
  • CB: "Scratching Visual Transformer's Back with Uniform Attention", ICCV, 2023 (NAVER). [Paper]
  • STL: "Fully Attentional Networks with Self-emerging Token Labeling", ICCV, 2023 (NVIDIA). [Paper][PyTorch]
  • ClusterFormer: "ClusterFormer: Clustering As A Universal Visual Learner", NeurIPS, 2023 (Rochester Institute of Technology (RIT)). [Paper]
  • SVT: "Scattering Vision Transformer: Spectral Mixing Matters", NeurIPS, 2023 (Microsoft). [Paper][PyTorch][Website]
  • CrossFormer++: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 (Zhejiang University). [Paper][PyTorch]
  • QFormer: "Vision Transformer with Quadrangle Attention", arXiv, 2023 (The University of Sydney). [Paper][Code (in construction)]
  • ViT-Calibrator: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 (Zhejiang University). [Paper]
  • SpectFormer: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
  • UniNeXt: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 (Alibaba). [Paper]
  • CageViT: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 (Southern University of Science and Technology). [Paper]
  • ?: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 (UIUC). [Paper]
  • 2-D-SSM: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 (Tel Aviv). [Paper][PyTorch]
  • NaViT: "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", NeurIPS, 2023 (DeepMind). [Paper]
  • DAT++: "DAT++: Spatially Dynamic Vision Transformer with Deformable Attention", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
  • ?: "Replacing softmax with ReLU in Vision Transformers", arXiv, 2023 (DeepMind). [Paper]
  • RMT: "RMT: Retentive Networks Meet Vision Transformers", arXiv, 2023 (CAS). [Paper]
  • reg: "Vision Transformers Need Registers", arXiv, 2023 (Meta). [Paper]
  • ChannelViT: "Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words", arXiv, 2023 (Insitro, CA). [Paper]
  • EViT: "EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention", arXiv, 2023 (Nankai University). [Paper]
  • ViR: "ViR: Vision Retention Networks", arXiv, 2023 (NVIDIA). [Paper]
  • abs-win: "Window Attention is Bugged: How not to Interpolate Position Embeddings", arXiv, 2023 (Meta). [Paper]
  • FMViT: "FMViT: A multiple-frequency mixing Vision Transformer", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
  • GroupMixFormer: "Advancing Vision Transformers with Group-Mix Attention", arXiv, 2023 (HKU). [Paper][PyTorch]
  • PGT: "Perceptual Group Tokenizer: Building Perception with Iterative Grouping", arXiv, 2023 (DeepMind). [Paper]
  • SCHEME: "SCHEME: Scalable Channer Mixer for Vision Transformers", arXiv, 2023 (UCSD). [Paper]
  • Agent-Attention: "Agent Attention: On the Integration of Softmax and Linear Attention", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
  • ViTamin: "ViTamin: Designing Scalable Vision Models in the Vision-Language Era", CVPR, 2024 (ByteDance). [Paper][PyTorch]
  • HIRI-ViT: "HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs", TPAMI, 2024 (HiDream.ai, China). [Paper]
  • SPFormer: "SPFormer: Enhancing Vision Transformer with Superpixel Representation", arXiv, 2024 (JHU). [Paper]
  • manifold-K: "A Manifold Representation of the Key in Vision Transformers", arXiv, 2024 (University of Oslo, Norway). [Paper]
  • BiXT: "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers", arXiv, 2024 (University of Melbourne). [Paper]
  • VisionLLaMA: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks", arXiv, 2024 (Meituan). [Paper][Code (in construction)]
  • xT: "xT: Nested Tokenization for Larger Context in Large Images", arXiv, 2024 (Berkeley). [Paper]
  • ACC-ViT: "ACC-ViT: Atrous Convolution's Comeback in Vision Transformers", arXiv, 2024 (Purdue). [Paper]
  • ViTAR: "ViTAR: Vision Transformer with Any Resolution", arXiv, 2024 (CAS). [Paper]
  • iLLaMA: "Adapting LLaMA Decoder to Vision Transformer", arXiv, 2024 (Shanghai AI Lab). [Paper]

Efficient Vision Transformer

  • DeiT: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (Facebook). [Paper][PyTorch]
  • ConViT: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (Facebook). [Paper][Code]
  • ?: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (NavInfo Europe, Netherlands). [Paper]
  • PS-ViT: "Vision Transformer with Progressive Sampling", ICCV, 2021 (CPII). [Paper]
  • HVT: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (Monash University). [Paper][PyTorch]
  • CrossViT: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (MIT-IBM). [Paper][PyTorch]
  • ViL: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • Visformer: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (Beihang University). [Paper][PyTorch]
  • MultiExitViT: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (Aarhus University, Denmark). [Paper][Tensorflow]
  • SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
  • DGE: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (Megvii). [Paper][PyTorch]
  • GG-Transformer: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (JHU). [Paper][Code (in construction)]
  • DynamicViT: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
  • ResT: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (Nanjing University). [Paper][PyTorch]
  • Adder-Transformer: "Adder Attention for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • SOFT: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (Fudan). [Paper][PyTorch][Website]
  • IA-RED2: "IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (MIT-IBM). [Paper][Website]
  • LocalViT: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (ETHZ). [Paper][PyTorch]
  • CCT: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • DiversePatch: "Vision Transformers with Patch Diversification", arXiv, 2021 (UT Austin + Facebook). [Paper][PyTorch]
  • SL-ViT: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (Aarhus University). [Paper]
  • ?: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (Aarhus University, Denmark). [Paper]
  • ViX: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (Indian Institute of Technology Bombay). [Paper]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • WideNet: "Go Wider Instead of Deeper", arXiv, 2021 (NUS). [Paper]
  • Armour: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (Arm). [Paper]
  • IPE: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (CUHK). [Paper]
  • DS-Net++: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (Monash University). [Paper][PyTorch]
  • UFO-ViT: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (Kakao). [Paper]
  • Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][PyTorch]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
  • ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][PyTorch]
  • EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][PyTorch]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
  • Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][JAX]
  • LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][PyTorch]
  • A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]
  • PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][PyTorch-1][PyTorch-2]
  • AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]
  • DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]
  • ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]
  • EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][PyTorch]
  • SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][PyTorch]
  • SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][PyTorch]
  • DFvT: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (Alibaba). [Paper]
  • M3ViT: "M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (UT Austin). [Paper][PyTorch]
  • ResT-V2: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (Nanjing University). [Paper][PyTorch]
  • DeiT-Manifold: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (Huawei). [Paper]
  • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (Snap). [Paper][PyTorch]
  • GhostNetV2: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (Huawei). [Paper][PyTorch]
  • ?: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (Borealis AI, Canada). [Paper]
  • TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]
  • MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]
  • ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]
  • CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][PyTorch]
  • EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]
  • SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][PyTorch]
  • EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
  • VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code (in construction)]
  • SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][PyTorch]
  • MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]
  • LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
  • Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]
  • XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]
  • PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]
  • ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]
  • DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][PyTorch]
  • MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][PyTorch]
  • ViT-LSLA: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (Southwest University). [Paper]
  • Token-Pooling: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 (Apple). [Paper]
  • Tri-Level: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (Northeastern University). [Paper][Code (in construction)]
  • ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech). [Paper]
  • ViTALiTy: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Rice University). [Paper]
  • HeatViT: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Northeastern University). [Paper]
  • ToMe: "Token Merging: Your ViT But Faster", ICLR, 2023 (Meta). [Paper][PyTorch]
  • HiViT: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (CAS). [Paper][PyTorch]
  • STViT: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 (Alibaba). [Paper][PyTorch]
  • SparseViT: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 (MIT). [Paper][Website]
  • Slide-Transformer: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 (Tsinghua University). [Paper][Code (in construction)]
  • RIFormer: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
  • EfficientViT: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 (Microsoft). [Paper][PyTorch]
  • Castling-ViT: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 (Meta). [Paper]
  • ViT-Ti: "RGB no more: Minimally-decoded JPEG Vision Transformers", CVPR, 2023 (UMich). [Paper]
  • Sparsifiner: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 (University of Toronto). [Paper]
  • ?: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 (Baidu). [Paper]
  • LTMP: "Learned Thresholds Token Merging and Pruning for Vision Transformers", ICMLW, 2023 (Ghent University, Belgium). [Paper][PyTorch][Website]
  • ReViT: "Make A Long Image Short: Adaptive Token Length for Vision Transformers", ECML PKDD, 2023 (Midea Grou, China). [Paper]
  • EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", ICCV, 2023 (MIT). [Paper][PyTorch]
  • MPCViT: "MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention", ICCV, 2023 (Peking). [Paper][PyTorch]
  • MST: "Masked Spiking Transformer", ICCV, 2023 (HKUST). [Paper]
  • EfficientFormerV2: "Rethinking Vision Transformers for MobileNet Size and Speed", ICCV, 2023 (Snap). [Paper][PyTorch]
  • DiffRate: "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • ElasticViT: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", ICCV, 2023 (Microsoft). [Paper]
  • FastViT: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", ICCV, 2023 (Apple). [Paper][PyTorch]
  • SeiT: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", ICCV, 2023 (NAVER). [Paper][PyTorch]
  • TokenReduction: "Which Tokens to Use? Investigating Token Reduction in Vision Transformers", ICCVW, 2023 (Aalborg University, Denmark). [Paper][PyTorch][Website]
  • LGViT: "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer", ACMMM, 2023 (Beijing Institute of Technology). [Paper]
  • LBP-WHT: "Efficient Low-rank Backpropagation for Vision Transformer Adaptation", NeurIPS, 2023 (UT Austin). [Paper]
  • FAT: "Lightweight Vision Transformer with Bidirectional Interaction", NeurIPS, 2023 (CAS). [Paper][PyTorch]
  • MCUFormer: "MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
  • SoViT: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", NeurIPS, 2023 (DeepMind). [Paper]
  • CloFormer: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 (CAS). [Paper]
  • Quadformer: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 (Tel Aviv). [Paper][Code (in construction)]
  • SparseFormer: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 (NUS). [Paper][Code (in construction)]
  • EMO: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 (Tencent). [Paper][PyTorch]
  • ByteFormer: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 (Apple). [Paper]
  • ?: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 (Jilin University). [Paper]
  • FasterViT: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 (NVIDIA). [Paper]
  • NextViT: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 (Baidu). [Paper]
  • SkipAt: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 (Qualcomm). [Paper]
  • MSViT: "MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers", arXiv, 2023 (Qualcomm). [Paper]
  • DiT: "DiT: Efficient Vision Transformers with Dynamic Token Routing", arXiv, 2023 (Meituan). [Paper][Code (in construction)]
  • ?: "Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers", arXiv, 2023 (German Research Center for Artificial Intelligence (DFKI)). [Paper][PyTorch]
  • Mobile-V-MoEs: "Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts", arXiv, 2023 (Apple). [Paper]
  • PPT: "PPT: Token Pruning and Pooling for Efficient Vision Transformers", arXiv, 2023 (Huawei). [Paper]
  • MatFormer: "MatFormer: Nested Transformer for Elastic Inference", arXiv, 2023 (Google). [Paper]
  • SparseFormer: "Bootstrapping SparseFormers from Vision Foundation Models", arXiv, 2023 (NUS). [Paper][PyTorch]
  • GTP-ViT: "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation", WACV, 2024 (CSIRO Data61, Australia). [Paper][PyTorch]
  • ToFu: "Token Fusion: Bridging the Gap between Token Pruning and Token Merging", WACV, 2024 (Samsung). [Paper]
  • Cached-Transformer: "Cached Transformers: Improving Transformers with Differentiable Memory Cache", AAAI, 2024 (CUHK). [Paper]
  • LF-ViT: "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition", AAAI, 2024 (Harbin Institute of Technology). [Paper][PyTorch]
  • EfficientMod: "Efficient Modulation for Vision Networks", ICLR, 2024 (Microsoft). [Paper][PyTorch]
  • NOSE: "MLP Can Be A Good Transformer Learner", CVPR, 2024 (MBZUAI). [Paper][PyTorch]
  • SLAB: "SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization", ICML, 2024 (Huawei). [Paper][PyTorch]
  • S2: "When Do We Not Need Larger Vision Models?", arXiv, 2024 (Berkeley). [Paper][PyTorch]

Conv + Transformer

  • LeViT: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • CeiT: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (SenseTime). [Paper][PyTorch (rishikksh20)]
  • Conformer: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (CAS). [Paper][PyTorch]
  • CoaT: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (UCSD). [Paper][PyTorch]
  • CvT: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (Microsoft). [Paper][Code]
  • ViTc: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (Facebook). [Paper]
  • ConTNet: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (ByteDance). [Paper][PyTorch]
  • SPACH: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (Microsoft). [Paper]
  • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][PyTorch]
  • CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch (in construction)]
  • TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]
  • ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][PyTorch]
  • ?: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (MBZUAI). [Paper][PyTorch]
  • DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code (in construction)]
  • iFormer: "Inception Transformer", NeurIPS, 2022 (Sea AI Lab). [Paper][PyTorch]
  • DenseDCT: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (University of Kansas). [Paper]
  • CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][PyTorch]
  • ConvMixer: "Patches Are All You Need?", arXiv, 2022 (CMU). [Paper][PyTorch]
  • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][PyTorch]
  • UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][PyTorch]
  • EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (?). [Paper]
  • MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]
  • Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]
  • MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab). [Paper][PyTorch]
  • STM: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (Tsinghua University). [Paper][Code (in construction)]
  • ParCNetV2: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (Intellifusion, China). [Paper]
  • VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
  • SD-MAE: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 (Hangzhou Dianzi University). [Paper][PyTorch (in construction)]
  • SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas). [Paper][PyTorch (in construction)]
  • SparK: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (Bytedance). [Paper][PyTorch]
  • MOAT: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (Google). [Paper][Tensorflow]
  • InternImage: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 (Shanghai AI Laboratory). [Paper][PyTorch]
  • SwiftFormer: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", ICCV, 2023 (MBZUAI). [Paper][PyTorch]
  • SCSC: "SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers", ICCVW, 2023 (Megvii). [Paper]
  • PSLT: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 (Sun Yat-sen University). [Paper][Website]
  • RepViT: "RepViT: Revisiting Mobile CNN From ViT Perspective", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
  • ?: "Interpret Vision Transformers as ConvNets with Dynamic Convolutions", arXiv, 2023 (NTU, Singapore). [Paper]
  • UPDP: "UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer", AAAI, 2024 (AMD). [Paper]

Training + Transformer

  • iGPT: "Generative Pretraining From Pixels", ICML, 2020 (OpenAI). [Paper][Tensorflow]
  • CLIP: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (OpenAI). [Paper][PyTorch]
  • MoCo-V3: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper]
  • DINO: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • drloc: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (University of Trento). [Paper][PyTorch]
  • CARE: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 (Tencent). [Paper][PyTorch]
  • MST: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (SenseTime). [Paper]
  • SiT: "SiT: Self-supervised Vision Transformer", arXiv, 2021 (University of Surrey). [Paper][PyTorch]
  • MoBY: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • ?: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (Pune Institute of Computer Technology, India). [Paper]
  • Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest). [Paper]
  • BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft). [Paper][PyTorch]
  • EsViT: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft). [Paper]
  • iBOT: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (ByteDance). [Paper][PyTorch]
  • MaskFeat: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (Facebook). [Paper]
  • AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia). [Paper][Code (in construction)]
  • MAE: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 (Facebook). [Paper][PyTorch][PyTorch (pengzhiliang)]
  • SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST). [Paper][PyTorch]
  • Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
  • TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU). [Paper][PyTorch]
  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (Arizona State). [Paper]
  • SplitMask: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 (Meta). [Paper]
  • MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK). [Paper]
  • RelViT: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (University of Padova, Italy). [Paper]
  • data2vec: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta). [Paper][PyTorch]
  • SSTA: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (Tencent). [Paper][Code (in construction)]
  • MP3: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (Apple). [Paper][PyTorch]
  • CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (Yonsei University, Korea). [Paper]
  • BootMAE: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK). [Paper][PyTorch]
  • ?: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (Peking University). [Paper][PyTorch]
  • HAT: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
  • IDMM: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (Nanjing University). [Paper]
  • AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (National Technical University of Athens). [Paper][PyTorch]
  • SLIP: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (Berkeley + Meta). [Paper][Pytorch]
  • mc-BEiT: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (Peking University). [Paper]
  • SL2O: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (UT Austin). [Paper][PyTorch]
  • TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (Korea University). [Paper][PyTorch]
  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (Arizona State University). [Paper]
  • GreenMIM: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (The University of Tokyo). [Paper][PyTorch]
  • DP-CutMix: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (Yonsei University). [Paper]
  • ?: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (Google). [Paper][Tensorflow][PyTorch (rwightman)]
  • PeCo: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
  • RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
  • Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS). [Paper][Code (in construction)]
  • Kronecker-Adaptation: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
  • DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (University of Bern, Switzerland). [Paper]
  • DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta). [Paper]
  • ?: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (Google). [Paper][Tensorflow]
  • ConvMAE: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (Shanghai AI Laboratory). [Paper][PyTorch (in construction)]
  • UM-MAE: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • GMML: "GMML is All you Need", arXiv, 2022 (University of Surrey, UK). [Paper][PyTorch]
  • SIM: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (SenseTime). [Paper]
  • SupMAE: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 (UT Austin). [Paper][PyTorch]
  • LoMaR: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUST). [Paper]
  • SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy). [Paper]
  • ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft). [Paper]
  • ?: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (Nankai University). [Paper]
  • ?: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University). [Paper]
  • Jigsaw-ViT: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (KU Leuven, Belgium). [Paper][PyTorch][Website]
  • BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MILAN: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (Princeton). [Paper][PyTorch (in construction)]
  • PSS: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania). [Paper][PyTorch]
  • dBOT: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 (ByteDance). [Paper]
  • PatchErasing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba). [Paper]
  • Self-Distillation: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (KAIST). [Paper]
  • AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University). [Paper][Code (in construction)]
  • LOCA: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (Google). [Paper]
  • FT-CLIP: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • MixPro: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (Beijing University of Chemical Technology). [Paper][PyTorch (in construction)]
  • ConMIM: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (Tencent). [Paper][Pytorch]
  • ccMIM: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (Shanghai Jiao Tong). [Paper]
  • CIM: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (Microsoft). [Paper]
  • MFM: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (NTU, Singapore). [Paper][Website]
  • Mask3D: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 (Meta). [Paper]
  • VisualAtom: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper][PyTorch][Website]
  • MixedAE: "Mixed Autoencoder for Self-supervised Visual Representation Learning", CVPR, 2023 (Huawei). [Paper]
  • TBM: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 (Singapore University of Technology and Design). [Paper]
  • LGSimCLR: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 (UMich). [Paper][PyTorch]
  • DisCo-CLIP: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 (IDEA). [Paper][PyTorch (in construction)]
  • MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
  • MAGE: "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 (Google). [Paper][PyTorch]
  • MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 (SenseTime). [Paper][PyTorch]
  • iTPN: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 (CAS). [Paper][PyTorch]
  • DropKey: "DropKey for Vision Transformer", CVPR, 2023 (Meitu). [Paper]
  • FlexiViT: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 (Google). [Paper][Tensorflow]
  • RA-CLIP: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 (Alibaba). [Paper]
  • CLIPPO: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 (Google). [Paper][JAX]
  • DMAE: "Masked Autoencoders Enable Efficient Knowledge Distillers", CVPR, 2023 (JHU + UC Santa Cruz). [Paper][PyTorch]
  • HPM: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 (CAS). [Paper][PyTorch]
  • LocalMIM: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 (Peking University). [Paper]
  • MaskAlign: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • RILS: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 (Tencent). [Paper][Code (in construction)]
  • RelaxMIM: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 (Megvii). [Paper]
  • FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
  • ?: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 (Google). [Paper]
  • OpenCLIP: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 (LAION). [Paper][PyTorch]
  • DiHT: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 (Meta). [Paper][PyTorch]
  • M3I-Pretraining: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
  • SN-Net: "Stitchable Neural Networks", CVPR, 2023 (Monash University). [Paper][PyTorch]
  • MAE-Lite: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 (Megvii). [Paper][PyTorch]
  • ViT-22B: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 (Google). [Paper]
  • GHN-3: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?", ICML, 2023 (Samsung). [Paper][PyTorch]
  • A2MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 (Westlake University, China). [Paper][PyTorch]
  • PQCL: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 (Alibaba). [Paper][PyTorch]
  • DreamTeacher: "DreamTeacher: Pretraining Image Backbones with Deep Generative Models", ICCV, 2023 (NIVIDA). [Paper][Website]
  • OFDB: "Pre-training Vision Transformers with Very Limited Synthesized Images", ICCV, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper][PyTorch]
  • MFF: "Improving Pixel-based MIM by Reducing Wasted Modeling Capability", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • TL-Align: "Token-Label Alignment for Vision Transformers", ICCV, 2023 (Tsinghua University). [Paper][PyTorch]
  • SMMix: "SMMix: Self-Motivated Image Mixing for Vision Transformers", ICCV, 2023 (Xiamen University). [Paper][PyTorch]
  • DiffMAE: "Diffusion Models as Masked Autoencoders", ICCV, 2023 (Meta). [Paper][Website]
  • MAWS: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", ICCV, 2023 (Meta). [Paper][PyTorch]
  • CountBench: "Teaching CLIP to Count to Ten", ICCV, 2023 (Google). [Paper]
  • CLIPpy: "Perceptual Grouping in Vision-Language Models", ICCV, 2023 (Apple). [Paper]
  • CiT: "CiT: Curation in Training for Effective Vision-Language Data", ICCV, 2023 (Meta). [Paper][PyTorch]
  • I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", ICCV, 2023 (Meta). [Paper]
  • EfficientTrain: "EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
  • StableRep: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", NeurIPS, 2023 (Google). [Paper][PyTorch]
  • LaCLIP: "Improving CLIP Training with Language Rewrites", NeurIPS, 2023 (Google). [Paper][PyTorch]
  • DesCo: "DesCo: Learning Object Recognition with Rich Language Descriptions", NeurIPS, 2023 (UCLA). [Paper]
  • ?: "Stable and low-precision training for large-scale vision-language models", NeurIPS, 2023 (UW). [Paper]
  • CapPa: "Image Captioners Are Scalable Vision Learners Too", NeurIPS, 2023 (DeepMind). [Paper][JAX]
  • IV-CL: "Does Visual Pretraining Help End-to-End Reasoning?", NeurIPS, 2023 (Google). [Paper]
  • CLIPA: "An Inverse Scaling Law for CLIP Training", NeurIPS, 2023 (UC Santa Cruz). [Paper][PyTorch]
  • Hummingbird: "Towards In-context Scene Understanding", NeurIPS, 2023 (DeepMind). [Paper]
  • RevColV2: "RevColV2: Exploring Disentangled Representations in Masked Image Modeling", NeurIPS, 2023 (Megvii). [Paper][PyTorch]
  • ALIA: "Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation", NeurIPS, 2023 (Berkeley). [Paper][PyTorch]
  • ?: "Improving Multimodal Datasets with Image Captioning", NeurIPS (Datasets and Benchmarks), 2023 (UW). [Paper]
  • CCViT: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 (Wuhan University). [Paper]
  • SoftCLIP: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 (Tencent). [Paper]
  • RECLIP: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 (Google). [Paper]
  • DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 (Meta). [Paper]
  • ?: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 (Meta). [Paper]
  • Filter: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 (Apple). [Paper]
  • ?: "Improved baselines for vision-language pre-training", arXiv, 2023 (Meta). [Paper]
  • 3T: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 (Google). [Paper]
  • ADDP: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 (CUHK + Tsinghua). [Paper]
  • MOFI: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 (Apple). [Paper]
  • MaPeT: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 (UniMoRE, Italy). [Paper][PyTorch]
  • RECO: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 (Google). [Paper]
  • CLIPA-v2: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 (UC Santa Cruz). [Paper][PyTorch]
  • PatchMixing: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 (Boston). [Paper][Website]
  • SN-Netv2: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 (Monash University). [Paper][PyTorch (in construction)]
  • CLIP-GPT: "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", arXiv, 2023 (Dublin City University, Ireland). [Paper]
  • FlexPredict: "Predicting masked tokens in stochastic locations improves masked image modeling", arXiv, 2023 (Meta). [Paper]
  • Soft-MoE: "From Sparse to Soft Mixtures of Experts", arXiv, 2023 (DeepMind). [Paper]
  • DropPos: "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions", NeurIPS, 2023 (CAS). [Paper][PyTorch]
  • MIRL: "Masked Image Residual Learning for Scaling Deeper Vision Transformers", NeurIPS, 2023 (Baidu). [Paper]
  • CMM: "Investigating the Limitation of CLIP Models: The Worst-Performing Categories", arXiv, 2023 (Nanjing University). [Paper]
  • LC-MAE: "Longer-range Contextualized Masked Autoencoder", arXiv, 2023 (NAVER). [Paper]
  • SILC: "SILC: Improving Vision Language Pretraining with Self-Distillation", arXiv, 2023 (ETHZ). [Paper]
  • CLIPTex: "CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement", arXiv, 2023 (Apple). [Paper]
  • NxTP: "Object Recognition as Next Token Prediction", arXiv, 2023 (Meta). [Paper][PyTorch]
  • ?: "Scaling Laws of Synthetic Images for Model Training ... for Now", arXiv, 2023 (Google). [Paper][PyTorch]
  • SynCLR: "Learning Vision from Models Rivals Learning Vision from Data", arXiv, 2023 (Google). [Paper][PyTorch]
  • EWA: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (Fudan). [Paper]
  • DTM: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (NAVER). [Paper]
  • SSAT: "Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders", WACV, 2024 (UNC Charlotte). [Paper][Code (in construction)]
  • FEC: "Neural Clustering based Visual Representation Learning", CVPR, 2024 (Zhejiang). [Paper]
  • EfficientTrain++: "EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training", TPAMI, 2024 (Tsinghua). [Paper][PyTorch]
  • DVT: "Denoising Vision Transformers", arXiv, 2024 (USC). [Paper][PyTorch][Website]
  • AIM: "Scalable Pre-training of Large Autoregressive Image Models", arXiv, 2024 (Apple). [Paper][PyTorch]
  • DDM: "Deconstructing Denoising Diffusion Models for Self-Supervised Learning", arXiv, 2024 (Meta). [Paper]
  • CrossMAE: "Rethinking Patch Dependence for Masked Autoencoders", arXiv, 2024 (Berkeley). [Paper][PyTorch][Website]
  • IWM: "Learning and Leveraging World Models in Visual Representation Learning", arXiv, 2024 (Meta). [Paper]
  • ?: "Can Generative Models Improve Self-Supervised Representation Learning?", arXiv, 2024 (Vector Institute). [Paper]

Robustness + Transformer

  • ViT-Robustness: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 (Google). [Paper]
  • SAGA: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 (University of Connecticut). [Paper]
  • ?: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (KAIST). [Paper][PyTorch]
  • ViTs-vs-CNNs: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 (JHU + UC Santa Cruz). [Paper][PyTorch]
  • T-CNN: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 (Facebook). [Paper]
  • Transformer-Attack: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 (Xi'an Jiaotong). [Paper]
  • ?: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 (University of Rennes). [Paper]
  • ?: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (ANU). [Paper][PyTorch]
  • ?: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (University of Pittsburgh). [Paper]
  • Token-Attack: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 (New York University). [Paper]
  • ?: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 (Google). [Paper]
  • ?: "Vision Transformers are Robust Learners", AAAI, 2022 (PyImageSearch + IBM). [Paper][Tensorflow]
  • PNA: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (Fudan + Maryland). [Paper][PyTorch]
  • MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (Rice University). [Paper]
  • Patch-Fool: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (Rice University). [Paper][PyTorch]
  • Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore). [Paper]
  • ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[Paper]
  • Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch). [Paper]
  • Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google). [Paper]
  • APRIL: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (CAS). [Paper]
  • Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT). [Paper][PyTorch]
  • RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba). [Paper][PyTorch]
  • Pyramid: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (Google). [Paper]
  • VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft). [Paper][PyTorch]
  • FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA). [Paper][PyTorch]
  • CFA: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (The University of Tokyo). [Paper][PyTorch]
  • ?: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (University of Exeter, UK). [Paper][PyTorch]
  • ?: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (Oxford). [Paper]
  • AGAT: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (Zhejiang University). [Paper]
  • ?: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 (TUM). [Paper]
  • ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz). [Paper][PyTorch]
  • ?: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
  • PAR: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (Tianjin University). [Paper]
  • RobustViT: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (Tel-Aviv). [Paper][PyTorch]
  • ?: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (Google). [Paper]
  • NVD: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (Boston). [Paper]
  • ?: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (UW-Madison). [Paper]
  • MA: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • ?: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft). [Paper]
  • ?: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
  • FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France). [Paper]
  • Backdoor-Transformer: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code (in construction)]
  • ?: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu). [Paper]
  • ?: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
  • ?: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (Yonsei University). [Paper]
  • CLIPping Privacy: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM). [Paper]
  • ?: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL). [Paper]
  • ?: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU). [Paper]
  • C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (Michigan State). [Paper]
  • ?: "Curved Representation Space of Vision Transformers", arXiv, 2022 (Yonsei University). [Paper]
  • RKDE: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (UT Austin). [Paper]
  • MRAP: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (Arizona State University). [Paper]
  • model-soup: "Revisiting adapters with adversarial training", ICLR, 2023 (DeepMind). [Paper]
  • ?: "Budgeted Training for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
  • RobustCNN: "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 (UC Santa Cruz + JHU). [Paper][PyTorch]
  • DMAE: "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 (Peking). [Paper][PyTorch]
  • TGR: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 (CUHK). [Paper][PyTorch]
  • TrojViT: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 (Indiana University Bloomington). [Paper]
  • RSPC: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 (MPI). [Paper]
  • TORA-ViT: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 (The University of Sydney). [Paper]
  • BadViT: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
  • ?: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 (University of Pittsburgh). [Paper]
  • RobustMAE: "Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting", ICCV, 2023 (USTC). [Paper][PyTorch (in construction)]
  • ?: "Efficiently Robustify Pre-trained Models", ICCV, 2023 (IIT Roorkee, India). [Paper]
  • ?: "Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients", ICCV, 2023 (Tsinghua). [Paper]
  • CleanCLIP: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", ICCV, 2023 (UCLA). [Paper][PyTorch]
  • QBBA: "Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks", BMVC, 2023 (Oxford). [Paper]
  • RBFormer: "RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias", BMVC, 2023 (HKUST). [Paper]
  • PreLayerNorm: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 (POSTECH). [Paper]
  • CertViT: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 (INRIA). [Paper][PyTorch]
  • RoCLIP: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 (UCLA). [Paper]
  • DeepMIM: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
  • TAP-ADL: "Robustifying Token Attention for Vision Transformers", ICCV, 2023 (MPI). [Paper][PyTorch]
  • EWA: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (Fudan). [Paper]
  • SlowFormer: "SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers", arXiv, 2023 (UC Davis). [Paper][PyTorch]
  • DTM: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (NAVER). [Paper]
  • SWARM: "Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers", CVPR, 2024 (Zhejiang). [Paper][Code (in construction)]
  • ?: "Safety of Multimodal Large Language Models on Images and Text", arXiv, 2024 (Shanghai AI Lab). [Paper]

Model Compression + Transformer

  • ViT-quant: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • VTP: "Visual Transformer Pruning", arXiv, 2021 (Huawei). [Paper]
  • MD-ViT: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 (Princeton). [Paper]
  • FQ-ViT: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 (Megvii). [Paper][PyTorch]
  • UVC: "Unified Visual Transformer Compression", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • MiniViT: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Auto-ViT-Acc: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (Northeastern University). [Paper]
  • APQ-ViT: "Towards Accurate Post-Training Quantization for Vision Transformer", ACMMM, 2022 (Beihang University). [Paper]
  • SPViT: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (Northeastern University). [Paper][PyTorch]
  • PSAQ-ViT: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (CAS). [Paper][PyTorch]
  • PTQ4ViT: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (Peking University). [Paper]
  • EAPruning: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (Meituan). [Paper]
  • Q-ViT: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 (Beihang University). [Paper][PyTorch]
  • SAViT: "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 (Hikvision). [Paper]
  • VTC-LFC: "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 (Alibaba). [Paper][PyTorch]
  • Q-ViT: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 (Megvii). [Paper]
  • VAQF: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (Northeastern University). [Paper]
  • VTP: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (UCLA). [Paper]
  • SiDT: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (UC Irvine). [Paper]
  • PSAQ-ViT-V2: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (CAS). [Paper][PyTorch]
  • AS: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 (Baidu). [Paper]
  • SaiT: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (Samsung). [Paper]
  • oViT: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (IST Austria). [Paper]
  • CPT-V: "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 (UT Austin). [Paper]
  • TPS: "Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers", CVPR, 2023 (Megvii). [Paper][PyTorch]
  • GPUSQ-ViT: "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023 (Fudan). [Paper]
  • X-Pruner: "X-Pruner: eXplainable Pruning for Vision Transformers", CVPR, 2023 (James Cook University, Australia). [Paper][PyTorch (in construction)]
  • NoisyQuant: "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023 (Nanjing University). [Paper]
  • NViT: "Global Vision Transformer Pruning with Hessian-Aware Saliency", CVPR, 2023 (NVIDIA). [Paper]
  • BinaryViT: "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models", CVPRW, 2023 (Huawei). [Paper][PyTorch]
  • OFQ: "Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023 (HKUST). [Paper][PyTorch]
  • UPop: "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", ICML, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • COMCAT: "COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models", ICML, 2023 (Rutgers). [Paper][PyTorch]
  • Evol-Q: "Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023 (UT Austin). [Paper][Code (in construction)]
  • BiViT: "BiViT: Extremely Compressed Binary Vision Transformer", ICCV, 2023 (Zhejiang University). [Paper]
  • I-ViT: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023 (CAS). [Paper][PyTorch]
  • RepQ-ViT: "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023 (CAS). [Paper][PyTorch]
  • LLM-FP4: "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", EMNLP, 2023 (HKUST). [Paper][Code (in construction)]
  • Q-HyViT: "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023 (Electronics and Telecommunications Research Institute (ETRI), Korea). [Paper]
  • Bi-ViT: "Bi-ViT: Pushing the Limit of Vision Transformer Quantization", arXiv, 2023 (Beihang University). [Paper]
  • BinaryViT: "BinaryViT: Towards Efficient and Accurate Binary Vision Transformers", arXiv, 2023 (CAS). [Paper]
  • Zero-TP: "Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers", arXiv, 2023 (Princeton). [Paper]
  • ?: "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing", arXiv, 2023 (Qualcomm). [Paper]
  • VVTQ: "Variation-aware Vision Transformer Quantization", arXiv, 2023 (HKUST). [Paper][PyTorch]
  • DIMAP: "Data-independent Module-aware Pruning for Hierarchical Vision Transformers", ICLR, 2024 (A*STAR). [Paper][Code (in construction)]
  • MADTP: "MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer", CVPR, 2024 (Fudan). [Paper][Code (in construction)]
  • DC-ViT: "Dense Vision Transformer Compression with Few Samples", CVPR, 2024 (Nanjing University). [Paper]

[Back to Overview]

Attention-Free

MLP-Series

  • RepMLP: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 (Megvii). [Paper][PyTorch]
  • EAMLP: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua University). [Paper]
  • Forward-Only: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 (Oxford). [Paper][PyTorch]
  • ResMLP: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (Facebook). [Paper]
  • ?: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 (Tsinghua). [Paper]
  • ViP: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
  • CCS: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 (Baidu). [Paper]
  • S2-MLPv2: "S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (Baidu). [Paper]
  • RaftMLP: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 (Rikkyo University, Japan). [Paper][PyTorch]
  • Hire-MLP: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (Huawei). [Paper]
  • Sparse-MLP: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 (NUS). [Paper]
  • ConvMLP: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • sMLP: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 (Microsoft). [Paper]
  • MLP-Mixer: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (Google). [Paper][Tensorflow][PyTorch-1 (lucidrains)][PyTorch-2 (rishikksh20)]
  • gMLP: "Pay Attention to MLPs", NeurIPS, 2021 (Google). [Paper][PyTorch (antonyvigouret)]
  • S2-MLP: "S2-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (Baidu). [Paper]
  • CycleMLP: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (HKU). [Paper][PyTorch]
  • AS-MLP: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (ShanghaiTech University). [Paper][PyTorch]
  • Wave-MLP: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei). [Paper][PyTorch]
  • DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent). [Paper][PyTorch]
  • STD: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (Huawei). [Paper]
  • AMixer: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (Tsinghua University). [Paper]
  • MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (Microsoft). [Paper]
  • ActiveMLP: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft). [Paper]
  • MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (Jiangsu University). [Paper][PyTorch]
  • PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China). [Paper][PyTorch]
  • SplitMixer: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (Quintic AI, California). [Paper][PyTorch]
  • gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan). [Paper]
  • ?: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (Berkeley). [Paper]
  • AFFNet: "Adaptive Frequency Filters As Efficient Global Token Mixers", ICCV, 2023 (Microsoft). [Paper]
  • Strip-MLP: "Strip-MLP: Efficient Token Interaction for Vision MLP", ICCV, 2023 (Southern University of Science and Technology). [Paper][PyTorch]

Other Attention-Free

  • DWNet: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Nankai Univerisy). [Paper][PyTorch]
  • PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab). [Paper][PyTorch]
  • ConvNext: "A ConvNet for the 2020s", CVPR, 2022 (Facebook). [Paper][PyTorch]
  • RepLKNet: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (Megvii). [Paper][MegEngine][PyTorch]
  • FocalNet: "Focal Modulation Networks", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
  • HorNet: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch][Website]
  • S4ND: "S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces", NeurIPS, 2022 (Stanford). [Paper]
  • Sequencer: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (Rikkyo University, Japan). [Paper]
  • MogaNet: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (Westlake University, China). [Paper]
  • Conv2Former: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (ByteDance). [Paper]
  • CoC: "Image as Set of Points", ICLR, 2023 (Northeastern). [Paper][PyTorch]
  • SLaK: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (UT Austin). [Paper][PyTorch]
  • ConvNeXt-V2: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR, 2023 (Meta). [Paper][PyTorch]
  • SPANet: "SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation", ICCV, 2023 (Korea Institute of Science and Technology). [Paper][Code (in construction)][Website]
  • DFFormer: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 (Rikkyo University, Japan). [Paper][Code (in construction)]
  • ?: "ConvNets Match Vision Transformers at Scale", arXiv, 2023 (DeepMind). [Paper]
  • VMamba: "VMamba: Visual State Space Model", arXiv, 2024 (CAS). [Paper][PyTorch]
  • Vim: "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][[PyTorch](https://github.com/hustvl/Vim
  • VRWKV: "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
  • LocalMamba: "LocalMamba: Visual State Space Model with Windowed Selective Scan", arXiv, 2024 (University of Sydney). [Paper][PyTorch]
  • SiMBA: "SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series", arXiv, 2024 (Microsoft). [Paper][PyTorch]
  • PlainMamba: "PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition", arXiv, 2024 (University of Edinburgh, Scotland). [Paper][PyTorch]
  • EfficientVMamba: "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba", arXiv, 2024 (The University of Sydney). [Paper][PyTorch]
  • RDNet: "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs", arXiv, 2024 (NAVER). [Paper]
  • MambaOut: "MambaOut: Do We Really Need Mamba for Vision?", arXiv, 2024 (NUS). [Paper][PyTorch]

[Back to Overview]

Analysis for Transformer

  • Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
  • Transformer-Explainability: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (Tel Aviv). [Paper][PyTorch]
  • ?: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 (Princeton). [Paper]
  • ?: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 (HKU). [Paper]
  • ?: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (Google). [Paper]
  • ?: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (MBZUAI). [Paper][PyTorch]
  • FoveaTer: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (UCSB). [Paper]
  • ?: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (Microsoft). [Paper]
  • ?: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (Google). [Paper]
  • ?: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (Horizon Robotic). [Paper]
  • ?: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 (Temple University). [Paper][PyTorch]
  • FDSL: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 (AIST). [Paper][PyTorch][Website]
  • AlterNet: "How Do Vision Transformers Work?", ICLR, 2022 (Yonsei University). [Paper][PyTorch]
  • ?: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (Google). [Paper][Tensorflow]
  • ?: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (Stanford). [Paper]
  • ?: "Three things everyone should know about Vision Transformers", ECCV, 2022 (Meta). [Paper]
  • ?: "Vision Transformers provably learn spatial structure", NeurIPS, 2022 (Princeton). [Paper]
  • AWD-ViT: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (JD). [Paper]
  • ?: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 (Quintic AI, CA). [Paper][Code]
  • MJP: "Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers", CVPR, 2023 (Tencent). [Paper][PyTorch]
  • ?: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • ?: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
  • ?: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 (Technion Israel Institute Of Technology). [Paper]
  • ProtoPFormer: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
  • ICLIP: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (HKUST). [Paper][Code (in construction)]
  • ?: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (Google). [Paper]
  • ?: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (Monash University). [Paper][PyTorch]
  • ViT-CX: "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 (HKUST). [Paper]
  • ?: "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper]
  • IAV: "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 (KAIST). [Paper]
  • ViT-Shapley: "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 (UW). [Paper][PyTorch]
  • ImageNet-X: "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 (Meta). [Paper]
  • ?: "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 (Rensselaer Polytechnic Institute, NY). [Paper]
  • ?: "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 (NAVER). [Paper][PyTorch (in construction)]
  • ?: "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 (Stanford). [Paper]
  • CLIP-Dissect: "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 (UCSD). [Paper]
  • ?: "Understanding Masked Autoencoders via Hierarchical Latent Variable Models", CVPR, 2023 (CMU). [Paper]
  • ?: "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", CVPR, 2023 (Maryland). [Paper][PyTorch][Website]
  • ?: "Masked Autoencoding Does Not Help Natural Language Supervision at Scale", CVPR, 2023 (Apple). [Paper]
  • ?: "On Data Scaling in Masked Image Modeling", CVPR, 2023 (Microsoft). [Paper][PyTorch]
  • ?: "Revealing the Dark Secrets of Masked Image Modeling", CVPR, 2023 (Microsoft). [Paper]
  • Vision-DiffMask: "VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking", CVPRW, 2023 (University of Amsterdam). [Paper][PyTorch]
  • ?: "A Multidimensional Analysis of Social Biases in Vision Transformers", ICCV, 2023 (University of Mannheim, Germany). [Paper][PyTorch]
  • ?: "Analyzing Vision Transformers for Image Classification in Class Embedding Space", NeurIPS, 2023 (Goethe University Frankfurt, Germany). [Paper]
  • BoB: "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks", NeurIPS, 2023 (NYU). [Paper][PyTorch]
  • ViT-CoT: "Are Vision Transformers More Data Hungry Than Newborn Visual Systems?", NeurIPS, 2023 (Indiana University Bloomington, Indiana). [Paper]
  • AtMan: "AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation", NeurIPS, 2023 (Aleph Alpha, Germany). [Paper][PyTorch]
  • AttentionViz: "AttentionViz: A Global View of Transformer Attention", arXiv, 2023 (Harvard). [Paper][Website]
  • ?: "Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields", arXiv, 2023 (POSTECH). [Paper]
  • ?: "Reviving Shift Equivariance in Vision Transformers", arXiv, 2023 (Maryland). [Paper]
  • ViT-ReciproCAM: "ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer", arXiv, 2023 (Intel). [Paper]
  • Eureka-moment: "Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems", arXiv, 2023 (Bosch). [Paper]
  • INTR: "A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis", arXiv, 2023 (OSU). [Paper][PyTorch]
  • ?: "Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention", AAAI, 2024 (Korea Institute of Science and Technology (KIST)). [Paper][PyTorch]
  • RelatiViT: "Can Transformers Capture Spatial Relations between Objects?", ICLR, 2024 (Tsinghua). [Paper][Code (in construction)][Website]
  • TokenTM: "Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer", CVPR, 2024 (Illinois Institute of Technology). [Paper]
  • SaCo: "On the Faithfulness of Vision Transformer Explanations", CVPR, 2024 (Illinois Institute of Technology). [Paper]
  • ?: "A Decade's Battle on Dataset Bias: Are We There Yet?", arXiv, 2024 (Meta). [Paper][Code (in construction)]
  • LeGrad: "LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity", arXiv, 2024 (University of Bonn, Germany). [Paper][PyTorch]

[Back to Overview]

Detection

Object Detection

  • General:
    • detrex: "detrex: Benchmarking Detection Transformers", arXiv, 2023 (IDEA). [Paper][PyTorch]
  • CNN-based backbone:
    • DETR: "End-to-End Object Detection with Transformers", ECCV, 2020 (Facebook). [Paper][PyTorch]
    • Deformable DETR: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 (SenseTime). [Paper][PyTorch]
    • UP-DETR: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch]
    • SMCA: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (CUHK). [Paper][PyTorch]
    • Conditional-DETR: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (Microsoft). [Paper]
    • PnP-DETR: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (Yitu). [Paper][Code (in construction)]
    • TSP: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 (CMU). [Paper]
    • Dynamic-DETR: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 (Microsoft). [Paper]
    • ViT-YOLO: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (Xidian University). [Paper]
    • ACT: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 (Peking + CUHK). [Paper][PyTorch]
    • DIL-ViT: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (Monash University Malaysia). [Paper]
    • Efficient-DETR: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (Megvii). [Paper]
    • CA-FPN: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (CAS). [Paper]
    • DETReg: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 (Tel-Aviv + Berkeley). [Paper][Website]
    • GQPos: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (Megvii). [Paper]
    • Anchor-DETR: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (Megvii). [Paper][PyTorch]
    • Sparse-DETR: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (Kakao). [Paper][PyTorch]
    • DAB-DETR: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 (IDEA, China). [Paper][PyTorch]
    • DN-DETR: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (International Digital Economy Academy (IDEA), China). [Paper][PyTorch]
    • SAM-DETR: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
    • AdaMixer: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (Nanjing University). [Paper][Code (in construction)]
    • DESTR: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (Oregon State). [Paper]
    • REGO: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
    • ?: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 (Ant Group). [Paper]
    • DE-DETR: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (JD). [Paper][PyTorch]
    • DFFT: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 (Tencent). [Paper]
    • Cornerformer: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 (Huawei). [Paper]
    • ?: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 (Microsoft). [Paper][Code (in construction)]
    • Obj2Seq: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 (CAS). [Paper][PyTorch]
    • KA: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (Zhejiang University). [Paper]
    • TCC: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 (The University of Sydney). [Paper]
    • Conditional-DETR-V2: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 (Peking University). [Paper]
    • SAM-DETR++: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
    • ComplETR: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 (Amazon). [Paper]
    • Pair-DETR: "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 (Amazon). [Paper]
    • Group-DETR-v2: "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 (Baidu). [Paper]
    • KD-DETR: "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 (Baidu). [Paper]
    • D3ETR: "D3ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 (Peking University). [Paper]
    • each-DETR: "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 (CUHK). [Paper][Code (in construction)]
    • DETA: "NMS Strikes Back", arXiv, 2022 (UT Austin). [Paper][PyTorch]
    • ViT-Adapter: "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Lite-DETR: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (IDEA). [Paper][Code (in construction)]
    • DDQ: "Dense Distinct Query for End-to-End Object Detection", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • SiameseDETR: "Siamese DETR", CVPR, 2023 (SenseTime). [Paper][PyTorch]
    • SAP-DETR: "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", CVPR, 2023 (CAS). [Paper]
    • Q-DETR: "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023 (Beihang University). [Paper][Code (in construction)]
    • Lite-DETR: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (IDEA). [Paper][PyTorch]
    • H-DETR: "DETRs with Hybrid Matching", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • MaskDINO: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", CVPR, 2023 (IDEA, China). [Paper][PyTorch]
    • IMFA: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", CVPR, 2023 (NTU, Singapore). [Paper][Code (in construction)]
    • SQR: "Enhanced Training of Query-Based Object Detection via Selective Query Recollection", CVPR, 2023 (CMU). [Paper][PyTorch]
    • DQ-Det: "Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation", ICML, 2023 (ByteDance). [Paper]
    • SpeedDETR: "SpeedDETR: Speed-aware Transformers for End-to-end Object Detection", ICML, 2023 (Northeastern University). [Paper]
    • AlignDet: "AlignDet: Aligning Pre-training and Fine-tuning in Object Detection", ICCV, 2023 (ByteDance). [Paper][PyTorch][Website]
    • Focus-DETR: "Less is More: Focus Attention for Efficient DETR", ICCV, 2023 (Huawei). [Paper][PyTorch][MindSpore]
    • Plain-DETR: "DETR Doesn't Need Multi-Scale or Locality Design", ICCV, 2023 (Microsoft). [Paper][Code (in construction)]
    • ASAG: "ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
    • MIMDet: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", ICCV, 2023 (Tencent). [Paper][PyTorch]
    • Stable-DINO: "Detection Transformer with Stable Matching", ICCV, 2023 (IDEA). [Paper][Code (in construction)]
    • imTED: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", ICCV, 2023 (CAS). [Paper][PyTorch]
    • Group-DETR: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", ICCV, 2023 (Baidu). [Paper][Code (in construction)]
    • Co-DETR: "DETRs with Collaborative Hybrid Assignments Training", ICCV, 2023 (SenseTime). [Paper][PyTorch]
    • DETRDistill: "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", ICCV, 2023 (USTC). [Paper]
    • Decoupled-DETR: "Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection", ICCV, 2023 (SenseTime). [Paper]
    • StageInteractor: "StageInteractor: Query-based Object Detector with Cross-stage Interaction", ICCV, 2023 (Nanjing University). [Paper]
    • Rank-DETR: "Rank-DETR for High Quality Object Detection", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
    • Cal-DETR: "Cal-DETR: Calibrated Detection Transformer", NeurIPS, 2023 (MBZUAI). [Paper][PyTorch]
    • KS-DETR: "KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer", arXiv, 2023 (Toyota Technological Institute). [Paper][PyTorch]
    • FeatAug-DETR: "FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation", arXiv, 2023 (CUHK). [Paper][Codee (in construction)]
    • RT-DETR: "DETRs Beat YOLOs on Real-time Object Detection", arXiv, 2023 (Baidu). [Paper]
    • Align-DETR: "Align-DETR: Improving DETR with Simple IoU-aware BCE loss", arXiv, 2023 (Megvii). [Paper][PyTorch]
    • Box-DETR: "Box-DETR: Understanding and Boxing Conditional Spatial Queries", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch (in construction)]
    • RefineBox: "Enhancing Your Trained DETRs with Box Refinement", arXiv, 2023 (CAS). [Paper][Code (in construction)]
    • ?: "Revisiting DETR Pre-training for Object Detection", arXiv, 2023 (Toronto). [Paper]
    • Gen2Det: "Gen2Det: Generate to Detect", arXiv, 2023 (Meta). [Paper]
    • ViT-CoMer: "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions", CVPR, 2024 (Baidu). [Paper][PyTorch]
    • Salience-DETR: "Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement", CVPR, 2024 (Xi'an Jiaotong University). [Paper][PyTorch]
    • MS-DETR: "MS-DETR: Efficient DETR Training with Mixed Supervision", arXiv, 2024 (Baidu). [Paper][Code (in construction)]
  • Transformer-based backbone:
    • ViT-FRCNN: "Toward Transformer-Based Object Detection", arXiv, 2020 (Pinterest). [Paper]
    • WB-DETR: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (CAS). [Paper]
    • YOLOS: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (Horizon Robotics). [Paper][PyTorch]
    • ?: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (Facebook). [Paper]
    • ViDT: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (NAVER). [Paper][PyTorch]
    • FP-DETR: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (USTC). [Paper]
    • DETR++: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (Google). [Paper]
    • ViTDet: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (Meta). [Paper]
    • UViT: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 (Google). [Paper]
    • CFDT: "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 (Huawei). [Paper]
    • D2ETR: "D2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (Alibaba). [Paper]
    • DINO: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 (IDEA, China). [Paper][PyTorch]
    • SimPLR: "SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation", arXiv, 2023 (UvA). [Paper]

[Back to Overview]

3D Object Detection

  • AST-GRU: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 (Baidu). [Paper][Code (in construction)]
  • Pointformer: "3D Object Detection with Pointformer", arXiv, 2020 (Tsinghua). [Paper]
  • CT3D: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (Alibaba). [Paper][Code (in construction)]
  • Group-Free-3D: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • VoTr: "Voxel Transformer for 3D Object Detection", ICCV, 2021 (CUHK + NUS). [Paper]
  • 3DETR: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • DETR3D: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 (MIT). [Paper]
  • M3DETR: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (University of Maryland). [Paper][PyTorch]
  • MonoDTR: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 (NTU). [Paper][Code (in construction)]
  • VoxSeT: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • TransFusion: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (HKUST). [Paper][PyTorch]
  • CAT-Det: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (Beihang University). [Paper]
  • TokenFusion: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (Tsinghua). [Paper]
  • SST: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (CAS). [Paper][PyTorch]
  • LIFT: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (Shanghai Jiao Tong University). [Paper]
  • BoxeR: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (University of Amsterdam). [Paper][PyTorch]
  • BrT: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 (Tsinghua). [Paper]
  • VISTA: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
  • STRL: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 (Bosch). [Paper]
  • MTrans: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 (HKU). [Paper][PyTorch]
  • CenterFormer: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (TuSimple). [Paper][Code (in construction)]
  • BUTD-DETR: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 (CMU). [Paper][PyTorch][Website]
  • SpatialDETR: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (Mercedes-Benz). [Paper][PyTorch]
  • CramNet: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (Waymo). [Paper]
  • SWFormer: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 (Waymo). [Paper]
  • EMMF-Det: "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 (Hikvision). [Paper]
  • UVTR: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 (CUHK). [Paper][PyTorch]
  • MsSVT: "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
  • DeepInteraction: "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 (Fudan). [Paper][PyTorch]
  • PETR: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (Megvii). [Paper]
  • Graph-DETR3D: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 (University of Science and Technology of China). [Paper]
  • PolarFormer: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (Fudan University). [Paper][Code (in construction)]
  • AST-GRU: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • SEFormer: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (Tsinghua University). [Paper]
  • CRAFT: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (KAIST). [Paper]
  • CrossDTR: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (NTU). [Paper][Code (in construction)]
  • Focal-PETR: "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • Li3DeTr: "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 (University of Coimbra, Portugal). [Paper]
  • PiMAE: "PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection", CVPR, 2023 (Peking University). [Paper][PyTorch]
  • OcTr: "OcTr: Octree-based Transformer for 3D Object Detection", CVPR, 2023 (Beihang University). [Paper]
  • MonoATT: "MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
  • PVT-SSD: "PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer", CVPR, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
  • ConQueR: "ConQueR: Query Contrast Voxel-DETR for 3D Object Detection", CVPR, 2023 (CUHK). [Paper][PyTorch][Website]
  • FrustumFormer: "FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection", CVPR, 2023 (CAS). [Paper][PyTorch (in construction)]
  • DSVT: "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets", CVPR, 2023 (Peking University). [Paper][PyTorch]
  • AShapeFormer: "AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers", CVPR, 2023 (Hunan University). [Paper][Code (in construction)]
  • MV-JAR: "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training", CVPR, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
  • FocalFormer3D: "FocalFormer3D: Focusing on Hard Instance for 3D Object Detection", ICCV, 2023 (NVIDIA). [Paper][PyTorch]
  • 3DPPE: "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", ICCV, 2023 (Houmo AI, China). [Paper][PyTorch]
  • PARQ: "Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection", ICCV, 2023 (Northeastern). [Paper][PyTorch][Website]
  • CMT: "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection", ICCV, 2023 (Megvii). [Paper][PyTorch]
  • MonoDETR: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", ICCV, 2023 (Shanghai AI Laboratory). [Paper][PyTorch]
  • DTH: "Efficient Transformer-based 3D Object Detection with Dynamic Token Halting", ICCV, 2023 (Cruise). [Paper]
  • PETRv2: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", ICCV, 2023 (Megvii). [Paper][PyTorch]
  • MV2D: "Object as Query: Lifting any 2D Object Detector to 3D Detection", ICCV, 2023 (Beihang University). [Paper]
  • ?: "An Empirical Analysis of Range for 3D Object Detection", ICCVW, 2023 (CMU). [Paper]
  • Uni3DETR: "Uni3DETR: Unified 3D Detection Transformer", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
  • Diffusion-SS3D: "Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection", NeurIPS, 2023 (NYCU). [Paper][PyTorch]
  • STEMD: "Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection", arXiv, 2023 (CUHK). [Paper][[Code (in construction)(https://github.com/Eaphan/STEMD)]]
  • V-DETR: "V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
  • 3DiffTection: "3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)][Website]
  • PTT: "PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection", arXiv, 2023 (UC Merced). [Paper][Code (in construction)]
  • Point-DETR3D: "Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection", AAAI, 2024 (USTC). [Paper]
  • MixSup: "MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection", ICLR, 2024 (CAS). [Paper][PyTorch]
  • QAF2D: "Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors", CVPR, 2024 (Nullmax, China). [Paper]
  • ScatterFormer: "ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention", arXiv, 2024 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
  • MsSVT++: "MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection", arXiv, 2024 (Beijing Institute of Technology). [Paper][PyTorch]

[Back to Overview]

Multi-Modal Detection

  • OVR-CNN: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 (Snap). [Paper][PyTorch]
  • MDETR: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (NYU). [Paper][PyTorch][Website]
  • FETNet: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (Tsinghua). [Paper]
  • MEDUSA: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 (Google). [Paper][PyTorch]
  • StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (Baidu). [Paper]
  • MAVL: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • OWL-ViT: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (Google). [Paper][JAX][Hugging Face]
  • X-DETR: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (Amazon). [Paper]
  • simCrossTrans: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (The City University of New York). [Paper][PyTorch]
  • ?: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (USC). [Paper]
  • YONOD: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 (CUNY). [Paper][PyTorch]
  • OmDet: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (Binjiang Institute of Zhejiang University). [Paper]
  • ContFormer: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (Peking University). [Paper]
  • DQ-DETR: "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 (International Digital Economy Academy (IDEA)). [Paper][Code (in construction)]
  • F-VLM: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", ICLR, 2023 (Google). [Paper][Website]
  • OV-3DET: "Open-Vocabulary Point-Cloud Object Detection without 3D Annotation", CVPR, 2023 (Peking University). [Paper][PyTorch]
  • Detection-Hub: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", CVPR, 2023 (Fudan + Microsoft). [Paper]
  • OmniLabel: "OmniLabel: A Challenging Benchmark for Language-Based Object Detection", ICCV, 2023 (NEC). [Paper][GitHub][Website]
  • MM-OVOD: "Multi-Modal Classifiers for Open-Vocabulary Object Detection", ICML, 2023 (Oxford). [Paper][Code (in construction)][Website]
  • CoDA: "CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection", NeurIPS, 2023 (HKUST). [Paper][PyTorch][Website]
  • ContextDET: "Contextual Object Detection with Multimodal Large Language Models", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
  • Object2Scene: "Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection", arXiv, 2023 (Shanghai AI Lab). [Paper]

[Back to Overview]

HOI Detection

  • HOI-Transformer: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 (Megvii). [Paper][PyTorch]
  • HOTR: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (Kakao + Korea University). [Paper][PyTorch]
  • MSTR: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (Kakao). [Paper]
  • SSRT: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 (Amazon). [Paper]
  • CPC: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 (Korea University). [Paper][PyTorch (in construction)]
  • DisTR: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 (Baidu). [Paper]
  • STIP: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 (JD). [Paper][PyTorch]
  • DOQ: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (South China University of Technology). [Paper]
  • UPT: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 (Australian Centre for Robotic Vision). [Paper][PyTorch][Website]
  • CATN: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (Huazhong University of Science and Technology). [Paper]
  • GEN-VLKT: "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection", CVPR, 2022 (Alibaba). [Paper][PyTorch]
  • HQM: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (South China University of Technology). [Paper][PyTorch]
  • Iwin: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (Shanghai Jiao Tong). [Paper]
  • RLIP: "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 (Alibaba). [Paper][PyTorch]
  • TUTOR: "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 (Shanghai Jiao Tong). [Paper]
  • ?: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • ?: "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 (KU Leuven). [Paper]
  • HOICLIP: "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models", CVPR, 2023 (ShanghaiTech). [Paper][Code (in construction)]
  • ViPLO: "ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection", CVPR, 2023 (mAy-I, Korea). [Paper][PyTorch]
  • OpenCat: "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework", CVPR, 2023 (Renmin University of China). [Paper]
  • CQL: "Category Query Learning for Human-Object Interaction Classification", CVPR, 2023 (Megvii). [Paper][Code (in construction)]
  • RmLR: "Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection", ICCV, 2023 (Southeast University, China). [Paper]
  • PViC: "Exploring Predicate Visual Context in Detecting of Human-Object Interactions", ICCV, 2023 (Microsoft). [Paper][PyTorch]
  • AGER: "Agglomerative Transformer for Human-Object Interaction Detection", ICCV, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
  • RLIPv2: "RLIPv2: Fast Scaling of Relational Language-Image Pre-training", ICCV, 2023 (Alibaba). [Paper][PyTorch]
  • EgoPCA: "EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding", ICCV, 2023 (Shanghai Jiao Tong). [Paper][Website]
  • UniHOI: "Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models", NeurIPS, 2023 (Southeast University). [Paper][Code (in construction)]
  • LogicHOI: "Neural-Logic Human-Object Interaction Detection", NeurIPS, 2023 (University of Technology Sydney). [Paper][Code (in construction)]
  • ?: "Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels", arXiv, 2023 (KU Leuven). [Paper]
  • DP-HOI: "Disentangled Pre-training for Human-Object Interaction Detection", CVPR, 2024 (South China University of Technology). [Paper][Code (in construction)]
  • HOI-Ref: "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision", arXiv, 2024 (University of Bristol, UK). [Paper][PyTorch][Website]

[Back to Overview]

Salient Object Detection

  • VST: "Visual Saliency Transformer", ICCV, 2021 (Northwestern Polytechincal University). [Paper]
  • ?: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 (Baidu). [Paper]
  • SwinNet: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (Anhui University). [Paper][Code]
  • SOD-Transformer: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
  • GLSTR: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 (South China University of Technology). [Paper]
  • TriTransNet: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (Anhui University). [Paper]
  • AbiU-Net: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 (Nankai University). [Paper]
  • TranSalNet: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (Cardiff University, UK). [Paper]
  • DFTR: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (Tencent). [Paper]
  • GroupTransNet: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (Nankai university). [Paper]
  • SelfReformer: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (NTU, Singapore). [Paper]
  • DTMINet: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (CUHK). [Paper]
  • MCNet: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • SiaTrans: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (Shandong University of Science and Technology). [Paper]
  • PSFormer: "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
  • RMFormer: "Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection", ACMMM, 2023 (Dalian University of Technology). [Paper]

[Back to Overview]

Other Detection Tasks

  • X-supervised:
    • LOST: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (Valeo.ai). [Paper][PyTorch]
    • Omni-DETR: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (Amazon). [Paper][PyTorch]
    • TokenCut: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
    • WS-DETR: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (Microsoft). [Paper]
    • TRT: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • TokenCut: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
    • Semi-DETR: "Semi-DETR: Semi-Supervised Object Detection With Detection Transformers", CVPR, 2023 (Baidu). [Paper][Paddle (in construction)][PyTorch (JCZ404)]
    • MoTok: "Object Discovery from Motion-Guided Tokens", CVPR, 2023 (Toyota). [Paper][PyTorch][Website]
    • CutLER: "Cut and Learn for Unsupervised Object Detection and Instance Segmentation", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • ISA-TS: "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames", ICML, 2023 (Google). [Paper]
    • MOST: "MOST: Multiple Object localization with Self-supervised Transformers for object discovery", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
    • GenPromp: "Generative Prompt Model for Weakly Supervised Object Localization", ICCV, 2023 (CAS). [Paper][PyTorch]
    • SAT: "Spatial-Aware Token for Weakly Supervised Object Localization", ICCV, 2023 (USTC). [Paper][PyTorch]
    • ALWOD: "ALWOD: Active Learning for Weakly-Supervised Object Detection", ICCV, 2023 (Rutgers). [Paper][Code (in construction)]
    • HASSOD: "HASSOD: Hierarchical Adaptive Self-Supervised Object Detection", NeurIPS, 2023 (UIUC). [Paper][PyTorch][Website]
    • SeqCo-DETR: "SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers", arXiv, 2023 (SenseTime). [Paper]
    • R-MAE: "R-MAE: Regions Meet Masked Autoencoders", arXiv, 2023 (Meta). [Paper]
    • SimDETR: "SimDETR: Simplifying self-supervised pretraining for DETR", arXiv, 2023 (Samsung). [Paper]
    • U2Seg: "Unsupervised Universal Image Segmentation", arXiv, 2023 (Berkely). [Paper][PyTorch]
    • CuVLER: "CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers", CVPR, 2024 (Technion - Israel Institute of Technology). [Paper][PyTorch]
    • Sparse-Semi-DETR: "Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection", CVPR, 2024 (DFKI, Germany). [Paper]
  • X-Shot Object Detection:
    • AIT: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 (Academia Sinica). [Paper]
    • Meta-DETR: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (NTU Singapore). [Paper][PyTorch]
    • CAT: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
    • FCT: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (Columbia). [Paper]
    • SaFT: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 (Microsoft). [Paper]
    • TENET: "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 (ANU). [Paper][PyTorch]
    • Meta-DETR: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (NTU, Singapore). [Paper]
    • Incremental-DETR: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (NUS). [Paper]
    • FS-DETR: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", ICCV, 2023 (Samsung). [Paper]
    • Meta-ZSDETR: "Meta-ZSDETR: Zero-shot DETR with Meta-learning", ICCV, 2023 (Fudan). [Paper]
    • ?: "Revisiting Few-Shot Object Detection with Vision-Language Models", arXiv, 2023 (CMU). [Paper]
  • Open-World/Vocabulary:
    • OW-DETR: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 (IIAI). [Paper][PyTorch]
    • DetPro: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • RegionCLIP: "RegionCLIP: Region-based Language-Image Pretraining", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • PromptDet: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (Meituan). [Paper][PyTorch][Website]
    • OV-DETR: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (NTU, Singapore). [Paper]
    • VL-PLM: "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 (Rutgers University). [Paper][PyTorch][Website]
    • DetCLIP: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 (HKUST). [Paper]
    • WWbL: "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 (Tel-Aviv). [Paper][PyTorch][Demo]
    • P3OVD: "P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 (Sun Yat-sen University). [Paper]
    • Open-World-DETR: "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 (NUS). [Paper]
    • BARON: "Aligning Bag of Regions for Open-Vocabulary Object Detection", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch]
    • CapDet: "CapDet: Unifying Dense Captioning and Open-World Detection Pretraining", CVPR, 2023 (Sun Yat-sen University). [Paper]
    • CORA: "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching", CVPR, 2023 (CUHK). [Paper][PyTorch]
    • UniDetector: "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR, 2023 (Tsinghua University). [Paper][PyTorch]
    • DetCLIPv2: "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment", CVPR, 2023 (Huawei). [Paper]
    • RO-ViT: "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers", CVPR, 2023 (Google). [Paper]
    • CAT: "CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection", CVPR, 2023 (Northeast University, China). [Paper][PyTorch]
    • CondHead: "Learning to Detect and Segment for Open Vocabulary Object Detection", CVPR, 2023 (Sichuan University). [Paper]
    • OADP: "Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection", CVPR, 2023 (Beihang University). [Paper][PyTorch]
    • OVAD: "Open-vocabulary Attribute Detection", CVPR, 2023 (University of Freiburg, Germany). [Paper][Website]
    • OvarNet: "OvarNet: Towards Open-vocabulary Object Attribute Recognition", CVPR, 2023 (Xiaohongshu). [Paper][Website][PyTorch]
    • ALLOW: "Annealing-Based Label-Transfer Learning for Open World Object Detection", CVPR, 2023 (Beihang University). [Paper][PyTorch]
    • PROB: "PROB: Probabilistic Objectness for Open World Object Detection", CVPR, 2023 (Stanford). [Paper][PyTorch][Website]
    • RandBox: "Random Boxes Are Open-world Object Detectors", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch]
    • Cascade-DETR: "Cascade-DETR: Delving into High-Quality Universal Object Detection", ICCV, 2023 (ETHZ + HKUST). [Paper][Pytorch]
    • EdaDet: "EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment", ICCV, 2023 (ShanghaiTech). [Paper][Website]
    • V3Det: "V3Det: Vast Vocabulary Visual Detection Dataset", ICCV, 2023 (Shanghai AI Lab). [Paper][GitHub][Website]
    • CoDet: "CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection", NeurIPS, 2023 (ByteDance). [Paper][PyTorch]
    • DAMEX: "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", NeurIPS, 2023 (Georgia Tech). [Paper][Code (in construction)]
    • OWL-ST: "Scaling Open-Vocabulary Object Detection", NeurIPS, 2023 (DeepMind). [Paper]
    • MQ-Det: "Multi-modal Queried Object Detection in the Wild", NeurIPS, 2023 (Tencent). [Paper][PyTorch]
    • Grounding-DINO: "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", arXiv, 2023 (IDEA). [Paper]
    • GridCLIP: "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning", arXiv, 2023 (Queen Mary University of London). [Paper]
    • ?: "Three ways to improve feature alignment for open vocabulary detection", arXiv, 2023 (DeepMind). [Paper]
    • PCL: "Open-Vocabulary Object Detection using Pseudo Caption Labels", arXiv, 2023 (Kakao). [Paper]
    • Prompt-OVD: "Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection", arXiv, 2023 (NAVER). [Paper]
    • LOWA: "LOWA: Localize Objects in the Wild with Attributes", arXiv, 2023 (Mineral, California). [Paper]
    • SGDN: "Open-Vocabulary Object Detection via Scene Graph Discovery", arXiv, 2023 (Monash University). [Paper]
    • SAS-Det: "Improving Pseudo Labels for Open-Vocabulary Object Detection", arXiv, 2023 (NEC). [Paper]
    • DE-ViT: "Detect Every Thing with Few Examples", arXiv, 2023 (Rutgers). [Paper][PyTorch]
    • CLIPSelf: "CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction", arXiv, 2023 (NTU, Singapore). [Papewr][PyTorch]
    • DST-Det: "DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection", arXiv, 2023 (NTU, Singapore). [Paper][Code (in consgtruction)]
    • DITO: "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection", arXiv, 2023 (DeepMind). [Paper]
    • RegionSpot: "Recognize Any Regions", arXiv, 2023 (University Of Surrey, England). [Paper][Code (in construction)]
    • DECOLA: "Language-conditioned Detection Transformer", arXiv, 2023 (UT Austin). [Paper][PyTorch]
    • PLAC: "Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection", arXiv, 2023 (Kakao). [Paper]
    • FOMO: "Open World Object Detection in the Era of Foundation Models", arXiv, 2023 (Stanford). [Paper][Website]
    • LP-OVOD: "LP-OVOD: Open-Vocabulary Object Detection by Linear Probing", WACV, 2024 (VinAI, Vietnam). [Paper]
    • ProxyDet: "ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection", WACV, 2024 (NAVER). [Paper]
    • WSOVOD: "Weakly Supervised Open-Vocabulary Object Detection", AAAI, 2024 (Xiamen University). [Paper][Code (i construction)]
    • CLIM: "CLIM: Contrastive Language-Image Mosaic for Region Representation", AAAI, 2024 (NTU, Singapore). [Paper][PyTorch]
    • SS-OWFormer: "Semi-supervised Open-World Object Detection", AAAI, 2024 (MBZUAI). [Paper][PyTorch]
    • DVDet: "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors", ICLR, 2024 (NTU, Singapore). [Paper]
    • GenerateU: "Generative Region-Language Pretraining for Open-Ended Object Detection", CVPR, 2024 (Monash University). [Paper][PyTorch]
    • DetCLIPv3: "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection", CVPR, 2024 (Huawei). [Paper]
    • RALF: "Retrieval-Augmented Open-Vocabulary Object Detection", CVPR, 2024 (Korea University). [Paper][Code (in construction)]
    • SHiNe: "SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection", CVPR, 2024 (NAVER). [Paper]
    • MM-Grounding-DINO: "An Open and Comprehensive Pipeline for Unified Object Grounding and Detection", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • YOLO-World: "YOLO-World: Real-Time Open-Vocabulary Object Detection", arXiv, 2024 (Tencent). [Paper][Code (in construction)]
    • T-Rex2: "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy", arXiv, 2024 (IDEA). [Paper][PyTorch][Website]
    • Grounding-DINO-1.5: "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", arXiv, 2024 (IDEA). [Paper][Code]
  • Pedestrian Detection:
    • PED: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (Tsinghua). [Paper][PyTorch]
    • ?: "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 (ICL). [Paper]
    • Pedestron: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (IIAI). [Paper][PyTorch]
    • VLPD: "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision", CVPR, 2023 (University of Science and Technology Beijing). [Paper][PyTorch]
  • Lane Detection:
    • LSTR: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 (Xi'an Jiaotong). [Paper][PyTorch]
    • LETR: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 (UCSD). [Paper][PyTorch]
    • Laneformer: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (Huawei). [Paper]
    • TLC: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 (Peking University). [Paper]
    • PersFormer: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (Shanghai AI Laboratory). [Paper][PyTorch]
    • MHVA: "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 (Beihang University). [Paper]
    • PriorLane: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (Zhejiang Lab). [Paper][PyTorch]
    • CurveFormer: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 (NullMax, China). [Paper]
    • LATR: "LATR: 3D Lane Detection from Monocular Images with Transformer", ICCV, 2023 (CUHK). [Paper][PyTorch]
    • O2SFormer: "End to End Lane detection with One-to-Several Transformer", arXiv, 2023 (Southeast University, China). [Paper][PyTorch]
    • Lane2Seq: "Lane2Seq: Towards Unified Lane Detection via Sequence Generation", CVPR, 2024 (Southeast University, China). [Paper]
  • Object Localization:
    • TS-CAM: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (CAS). [Paper]
    • LCTR: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 (Xiamen University). [Paper]
    • ViTOL: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (Mercedes-Benz). [Paper][PyTorch]
    • SCM: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (CUHK). [Paper][PyTorch]
    • CaFT: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper]
    • CoW: "CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation", CVPR, 2023 (Columbia). [Paper][PyTorch][Website]
    • ESC: "ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation", ICML, 2023 (UCSC). [Paper]
  • Relation Detection:
    • PST: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 (Amazon). [Paper]
    • PST: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 (Amazon). [Paper]
    • TROI: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 (NUS, Singapore). [Paper]
    • RelTransformer: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (KAUST). [Paper][PyTorch]
    • VReBERT: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (ANU). [Paper]
    • UniVRD: "Unified Visual Relationship Detection with Vision and Language Models", ICCV, 2023 (Google). [Paper][Code (in construction)]
    • RECODE: "Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models", NeurIPS, 2023 (Zhejiang University). [Paper]
    • SG-ViT: "Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection", arXiv, 2024 (DeepMind). [Paper]
  • Anomaly Detection:
    • VT-ADL: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 (University of Udine, Italy). [Paper]
    • InTra: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 (Fujitsu). [Paper]
    • AnoViT: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (Korea University). [Paper]
    • WinCLIP: "WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation", CVPR, 2023 (Amazon). [Paper]
    • M3DM: "Multimodal Industrial Anomaly Detection via Hybrid Fusion", CVPR, 2023 (Tencent). [Paper][PyTorch]
  • Cross-Domain:
    • SSTN: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
    • MTTrans: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (Beihang University). [Paper]
    • OAA-OTA: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
    • SSTA: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
    • DETR-GA: "DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection", CVPR, 2023 (Beihang University). [Paper]
    • DA-DETR: "DA-DETR: Domain Adaptive Detection Transformer with Information Fusion", CVPR, 2023 (NTU, Singapore). [Paper]
    • ?: "CLIP the Gap: A Single Domain Generalization Approach for Object Detection", CVPR, 2023 (EPFL). [Paper][PyTorch]
    • PM-DETR: "PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers", arXiv, 2023 (Peking). [Paper]
  • Co-Salient Object Detection:
    • CoSformer: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 (Nanjing University). [Paper]
  • Oriented Object Detection:
    • O2DETR: "Oriented Object Detection with Transformer", arXiv, 2021 (Baidu). [Paper]
    • AO2-DETR: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 (Peking University). [Paper]
    • ARS-DETR: "ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer", arXiv, 2023 (Harbin Institude of Technology). [Paper][PyTorch]
    • RHINO: "RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection", arXiv, 2023 (SI Analytics). [Paper]
  • Multiview Detection:
    • MVDeTr: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 (ANU). [Paper]
  • Polygon Detection:
    • ?: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 (Delft University of Technology, Netherlands). [Paper]
  • Drone-view:
    • TPH: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 (Beihang University). [Paper]
    • TransVisDrone: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (UCF). [Paper][Code (in construction)]
  • Infrared:
    • ?: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (Chongqing University of Posts and Telecommunications). [Paper]
    • MiPa: "MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection", arXiv, 2024 (ETS Montreal). [Paper][Code (in construction)]
  • Text Detection:
    • SwinTextSpotter: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
    • TESTR: "Text Spotting Transformers", CVPR, 2022 (UCSD). [Paper][PyTorch]
    • TTS: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (Amazon). [Paper]
    • oCLIP: "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 (ByteDance). [Paper]
    • TransDETR: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • ?: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (University of Waterloo, Canada). [Paper]
    • ?: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 (University of Science and Technology Beijing). [Paper][Code (in construction)]
    • DPTNet: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (Xiamen University). [Paper]
    • ATTR: "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 (Fudan). [Paper]
    • DPText-DETR: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", AAAI, 2023 (JD). [Paper][PyTorch]
    • TCM: "Turning a CLIP Model into a Scene Text Detector", CVPR, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • DeepSolo: "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting", CVPR, 2023 (JD). [Paper][PyTorch]
    • ESTextSpotter: "ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer", ICCV, 2023 (South China University of Technology). [Paper][PyTorch]
    • PBFormer: "PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer", ACMMM, 2023 (Huawei). [Paper]
    • DeepSolo++: "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting", arXiv, 2023 (JD). [Paper][PyTorch]
    • FastTCM: "Turning a CLIP Model into a Scene Text Spotter", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • SRFormer: "SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • TGA: "Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis", CVPR, 2024 (Microsoft). [Paper]
    • SwinTextSpotter-v2: "SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting", arXiv, 2024 (South China University of Technology). [Paper]
  • Change Detection:
    • ChangeFormer: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 (JHU). [Paper][PyTorch]
    • IDET: "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 (Civil Aviation University of China). [Paper]
  • Edge Detection:
    • EDTER: "EDTER: Edge Detection with Transformer", CVPR, 2022 (Beijing Jiaotong University). [Paper][Code (in construction)]
    • HEAT: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 (Simon Fraser). [Paper][PyTorch][Website]
  • Person Search:
    • COAT: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 (Kitware). [Paper][PyTorch]
    • PSTR: "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 (Tianjin University). [Paper][PyTorch]
  • Manipulation Detection:
    • ObjectFormer: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 (Fudan University). [Paper]
  • Mirror Detection:
    • SATNet: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • Shadow Detection:
    • SCOTCH-SODA: "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", CVPR, 2023 (University of Cambridge). [Paper]
  • Keypoint Detection:
    • SalViT: "From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection", arXiv, 2023 (ANU). [Paper]
  • Continual Learning:
    • CL-DETR: "Continual Detection Transformer for Incremental Object Detection", CVPR, 2023 (MPI). [Paper]
  • Visual Query Detection/Localization:
    • CocoFormer: "Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization", CVPR, 2023 (Meta). [Paper][PyTorch]
    • VQLoC: "Single-Stage Visual Query Localization in Egocentric Videos", NeurIPS, 2023 (UT Austin). [Paper][PyTorch][Website]
  • Task-Driven Object Detection:
    • CoTDet: "CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection", ICCV, 2023 (ShanghaiTech). [Paper]
  • Diffusion:
    • DiffusionEngine: "DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection", arXiv, 2023 (ByteDance). [Paper][PyTorch][Website]
    • TADP: "Text-image Alignment for Diffusion-based Perception", arXiv, 2023 (CalTech). [Paper][Website]
    • InstaGen: "InstaGen: Enhancing Object Detection by Training on Synthetic Dataset", arXiv, 2024 (Meituan). [Paper][Code (in construction)][Website]

[Back to Overview]

Segmentation

Semantic Segmentation

  • SETR: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch][Website]
  • TrSeg: "TrSeg: Transformer for semantic segmentation", PRL, 2021 (Korea University). [Paper][PyTorch]
  • CWT: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • Segmenter: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (INRIA). [Paper][PyTorch]
  • UN-EPT: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (Amazon). [Paper][PyTorch]
  • FTN: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 (Baidu). [Paper]
  • SegFormer: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • MaskFormer: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", NeurIPS, 2021 (UIUC + Facebook). [Paper][Website]
  • OffRoadTranSeg: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 (IISER. India). [Paper]
  • TRFS: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (ETHZ). [Paper]
  • Flying-Guide-Dog: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (KIT, Germany). [Paper][Code (in construction)]
  • VSPW: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 (Xiaomi). [Paper]
  • SDTP: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (?). [Paper]
  • TopFormer: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • HRViT: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch]
  • GReaT: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 (HKUST). [Paper]
  • SegDeformer: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 (Shanghai Jiao Tong + Huawei). [Paper][PyTorch]
  • PAUMER: "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 (Idiap, Switzerland). [Paper]
  • SegViT: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (The University of Adelaide, Australia). [Paper][PyTorch]
  • RTFormer: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (Baidu). [Paper][Paddle]
  • SegNeXt: "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 (Tsinghua University). [Paper]
  • Lawin: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • PFT: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (CUHK + SenseTime). [Paper]
  • DFlatFormer: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 (OPPO). [Paper]
  • FeSeFormer: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (Baidu). [Paper]
  • StructToken: "StructToken: Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (Shanghai AI Lab). [Paper]
  • HILA: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 (University of Toronto). [Paper][Website][PyTorch]
  • HLG: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 (Fudan University). [Paper][PyTorch]
  • SSformer: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • NamedMask: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (Oxford). [Paper][PyTorch][Website]
  • IncepFormer: "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • SeaFormer: "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 (Tencent). [Paper]
  • PPL: "Probabilistic Prompt Learning for Dense Prediction", CVPR, 2023 (Yonsei). [Paper]
  • AFF: "AutoFocusFormer: Image Segmentation off the Grid", CVPR, 2023 (Apple). [Paper]
  • CTS: "Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers", CVPR, 2023 (Eindhoven University of Technology, Netherlands). [Paper][PyTorch][Website]
  • TSG: "Transformer Scale Gate for Semantic Segmentation", CVPR, 2023 (Monash University, Australia). [Paper]
  • FASeg: "Dynamic Focus-aware Positional Queries for Semantic Segmentation", CVPR, 2023 (Monash University, Australia). [Paper][PyTorch]
  • HFD-BSD: "A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation", ICCV, 2023 (HKUST). [Paper]
  • DToP: "Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation", ICCV, 2023 (South China University of Technology + The University of Adelaide). [Paper]
  • FreeMask: "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models", NeurIPS, 2023 (HKU). [Paper][PyTorch]
  • AiluRus: "AiluRus: A Scalable ViT Framework for Dense Prediction", NeurIPS, 2023 (Huawei). [Paper][Code (in construction)]
  • SegViTv2: "SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (The University of Adelaide, Australia). [Paper][PyTorch]
  • DoViT: "Dynamic Token-Pass Transformers for Semantic Segmentation", arXiv, 2023 (Alibaba). [Paper]
  • CFT: "Category Feature Transformer for Semantic Segmentation", arXiv, 2023 (Huawei). [Paper]
  • ICPC: "ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation", arXiv, 2023 (Alibaba). [Paper]
  • Superpixel-Association: "Superpixel Transformers for Efficient Semantic Segmentation", arXiv, 2023 (Google). [Paper]
  • PlainSeg: "Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
  • SCTNet: "SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation", AAAI, 2024 (Meituan). [Paper][Code (in construction)]
  • ?: "Region-Based Representations Revisited", arXiv, 2024 (UIUC). [Paper]

[Back to Overview]

Depth Estimation

  • DPT: "Vision Transformers for Dense Prediction", ICCV, 2021 (Intel). [Paper][PyTorch]
  • TransDepth: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 (Haerbin Institute of Technology + University of Trento). [Paper][PyTorch]
  • ASTransformer: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (USTC). [Paper][PyTorch]
  • MT-SfMLearner: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
  • DepthFormer: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 (Toyota). [Paper]
  • GuideFormer: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (Agency for Defense Development, Korea). [Paper]
  • SparseFormer: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (Meta). [Paper]
  • DEST: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (NVIDIA). [Paper]
  • MonoViT: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (University of Bologna, Italy). [Paper][PyTorch]
  • Spike-Transformer: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 (Peking University). [Paper][PyTorch]
  • ?: "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 (IIT Madras). [Paper]
  • GLPanoDepth: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (Nanjing University). [Paper]
  • DepthFormer: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • BinsFormer: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • SideRT: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (Meituan). [Paper]
  • MonoFormer: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (DGIST, Korea). [Paper]
  • Depthformer: "Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 (Indian Institute of Technology Delhi). [Paper]
  • TODE-Trans: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (USTC). [Paper][Code (in construction)]
  • ObjCAViT: "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 (ICL). [Paper]
  • ROIFormer: "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 (OPPO). [Paper]
  • TST: "Lightweight Monocular Depth Estimation via Token-Sharing Transformer", ICRA, 2023 (KAIST). [Paper]
  • CompletionFormer: "CompletionFormer: Depth Completion with Convolutions and Vision Transformers", CVPR, 2023 (University of Bologna, Italy). [Paper][PyTorch][Website]
  • Lite-Mono: "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", CVPR, 2023 (University of Twente, Netherlands). [Paper][PyTorch]
  • EGformer: "EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation", ICCV, 2023 (SNU). [Paper]
  • ZeroDepth: "Towards Zero-Shot Scale-Aware Monocular Depth Estimation", ICCV, 2023 (Toyota). [Paper][PyTorch][Website]
  • Win-Win: "Win-Win: Training High-Resolution Vision Transformers from Two Windows", arXiv, 2023 (NAVER). [Paper]
  • ?: "Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation", WACV, 2024 (Southern University of Science and Technology). [Paper]
  • DeCoTR: "DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions", CVPR, 2024 (Qualcomm). [Paper]
  • Depth-Anything: "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data", arXiv, 2024 (TikTok). [Paper][PyTorch][Website]

[Back to Overview]

Object Segmentation

  • SOTR: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (China Agricultural University). [Paper][PyTorch]
  • Trans4Trans: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
  • Trans2Seg: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (HKU + SenseTime). [Paper][PyTorch]
  • SOIT: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (Hikvision). [Paper][PyTorch]
  • CAST: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 (Berkeley). [Paper]
  • ?: "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 (Aalto University, Finland). [Paper]
  • MSMFormer: "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 (UT Dallas). [Paper][PyTorch]

[Back to Overview]

Other Segmentation Tasks

  • Any-X/Every-X:
    • SAM: "Segment Anything", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
    • SEEM: "Segment Everything Everywhere All at Once", NeurIPS, 2023 (Microsoft). [Paper][PyTorch]
    • HQ-SAM: "Segment Anything in High Quality", NeurIPS, 2023 (ETHZ). [Paper][PyTorch]
    • ?: "An Empirical Study on the Robustness of the Segment Anything Model (SAM)", arXiv, 2023 (UCSB). [Paper]
    • ?: "A Comprehensive Survey on Segment Anything Model for Vision and Beyond", arXiv, 2023 (HKUST). [Paper]
    • SAD: "SAD: Segment Any RGBD", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
    • ?: "A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering", arXiv, 2023 (Kyung Hee University, Korea). [Paper]
    • ?: "Robustness of SAM: Segment Anything Under Corruptions and Beyond", arXiv, 2023 (Kyung Hee University). [Paper]
    • FastSAM: "Fast Segment Anything", arXiv, 2023 (CAS). [Paper][PyTorch]
    • MobileSAM: "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications", arXiv, 2023 (Kyung Hee University). [Paper][PyTorch]
    • Semantic-SAM: "Semantic-SAM: Segment and Recognize Anything at Any Granularity", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
    • Follow-Anything: "Follow Anything: Open-set detection, tracking, and following in real-time", arXiv, 2023 (MIT). [Paper]
    • DINOv: "Visual In-Context Prompting", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
    • Stable-SAM: "Stable Segment Anything Model", arXiv, 2023 (Kuaishou). [Paper][Code (in construction)]
    • EfficientSAM: "EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything", arXiv, 2023 (Meta). [Paper]
    • EdgeSAM: "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • RepViT-SAM: "RepViT-SAM: Towards Real-Time Segmenting Anything", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
    • SlimSAM: "0.1% Data Makes Segment Anything Slim", arXiv, 2023 (NUS). [Paper][PyTorch]
    • FIND: "Interfacing Foundation Models' Embeddings", arXiv, 2023 (Microsoft). [Paper][PyTorch (in construction)][Website]
    • SqueezeSAM: "SqueezeSAM: User-friendly mobile interactive segmentation", arXiv, 2023 (Meta). [Paper]
    • TAP: "Tokenize Anything via Prompting", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • MobileSAMv2: "MobileSAMv2: Faster Segment Anything to Everything", arXiv, 2023 (Kyung Hee University). [Paper][PyTorch]
    • TinySAM: "TinySAM: Pushing the Envelope for Efficient Segment Anything Model", arXiv, 2023 (Huawei). [Paper][PyTorch]
    • Conv-LoRA: "Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model", ICLR, 2024 (Amazon). [Paper][PyTorch]
    • PerSAM: "Personalize Segment Anything Model with One Shot", ICLR, 2024 (CUHK). [Paper][PyTorch]
    • VRP-SAM: "VRP-SAM: SAM with Visual Reference Prompt", CVPR, 2024 (Baidu). [Paper]
    • UAD: "Unsegment Anything by Simulating Deformation", CVPR, 2024 (NUS). [Paper][PyTorch]
    • ASAM: "ASAM: Boosting Segment Anything Model with Adversarial Tuning", CVPR, 2024 (vivo). [Paper][PyTorch][Website]
    • PTQ4SAM: "PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024 (Beihang). [Paper][PyTorch]
    • BA-SAM: "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model", arXiv, 2024 (Shanghai Jiao Tong). [Paper]
    • OV-SAM: "Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively", arXiv, 2024 (NTU, Singapore). [Paper][PyTorch][Website]
    • SSPrompt: "Learning to Prompt Segment Anything Models", arXiv, 2024 (NTU, Singapore). [Paper]
    • RAP-SAM: "RAP-SAM: Towards Real-Time All-Purpose Segment Anything", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • PA-SAM: "PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation", arXiv, 2024 (OPPO). [Paper][PyTorch]
    • Grounded-SAM: "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks", arXiv, 2024 (IDEA). [Paper][PyTorch]
    • EfficientViT-SAM: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss", arXiv, 2024 (NVIDIA). [Paper][PyTorch]
    • DeiSAM: "DeiSAM: Segment Anything with Deictic Prompting", arXiv, 2024 (TU Darmstadt, Germany). [Paper]
    • CAT-SAM: "CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model", arXiv, 2024 (NTU, Singapore). [Paper][PyTorch (in construction)][Website]
    • BLO-SAM: "BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM", arXiv, 2024 (UCSD). [Paper][PyTorch]
    • P2SAM: "Part-aware Personalized Segment Anything Model for Patient-Specific Segmentation", arXiv, 2024 (UMich). [Paper]
    • RA: "Practical Region-level Attack against Segment Anything Models", arXiv, 2024 (UIUC). [Paper]
  • Vision-Language:
    • LSeg: "Language-driven Semantic Segmentation", ICLR, 2022 (Cornell). [Paper][PyTorch]
    • ZegFormer: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
    • CLIPSeg: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 (University of Göttingen, Germany). [Paper][PyTorch]
    • DenseCLIP: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 (Tsinghua University). [Paper][PyTorch][Website]
    • GroupViT: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (NVIDIA). [Paper][Website][PyTorch]
    • MaskCLIP: "Extract Free Dense Labels from CLIP", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • ViewCo: "ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency", ICLR, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
    • LMSeg: "LMSeg: Language-guided Multi-dataset Segmentation", ICLR, 2023 (Alibaba). [Paper]
    • VL-Fields: "VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations", ICRA, 2023 (University of Edinburgh, UK). [Paper][Website]
    • X-Decoder: "Generalized Decoding for Pixel, Image, and Language", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
    • IFSeg: "IFSeg: Image-free Semantic Segmentation via Vision-Language Model", CVPR, 2023 (KAIST). [Paper][PyTorch]
    • SAZS: "Delving into Shape-aware Zero-shot Semantic Segmentation", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
    • CLIP-S4: "CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation", CVPR, 2023 (Bosch). [Paper]
    • D2Zero: "Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation", CVPR, 2023 (Zhejiang University). [Paper][Code (in construction)][Website]
    • PADing: "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation", CVPR, 2023 (Zhejiang University). [Paper][PyTorch][Website]
    • LD-ZNet: "LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation", ICCV, 2023 (Amazon). [Paper][PyTorch][Website]
    • MAFT: "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation", NeurIPS, 2023 (Picsart). [Paper][PyTorch]
    • PGSeg: "Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation", NeurIPS, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
    • MESS: "What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (IBM). [Paper][PyTorch][Website]
    • ZegOT: "ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts", arXiv, 2023 (KAIST). [Paper]
    • SimCon: "SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation", arXiv, 2023 (Amazon). [Paper]
    • DiffusionSeg: "DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • ASCG: "Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation", arXiv, 2023 (ByteDance). [Paper]
    • ClsCLIP: "[CLS] Token is All You Need for Zero-Shot Semantic Segmentation", arXiv, 2023 (Eastern Institute for Advanced Study, China). [Paper]
    • CLIPTeacher: "CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation", arXiv, 2023 (Nagoya University). [Paper]
    • SAM-CLIP: "SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding", arXiv, 2023 (Apple). [Paper]
    • GEM: "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", arXiv, 2023 (University of Bonn, Germany). [Paper][PyTorch]
    • CaR: "CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor", arXiv, 2023 (Google). [Paper][Code (in construction)][Website]
    • SPT: "Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation", AAAI, 2024 (Beijing University of Posts and Telecommunications). [Paper][PyTorch (in construction)]
    • FMbSeg: "Annotation Free Semantic Segmentation with Vision Foundation Models", arXiv, 2024 (Toyota). [Paper]
  • Open-World/Vocabulary:
    • ViL-Seg: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (CUHK). [Paper]
    • OVSS: "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • OpenSeg: "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels", ECCV, 2022 (Google). [Paper]
    • Fusioner: "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 (Shanghai Jiao Tong University). [Paper][Website]
    • OVSeg: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • ZegCLIP: "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", CVPR, 2023 (The University of Adelaide, Australia). [Paper][PyTorch]
    • TCL: "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", CVPR, 2023 (Kakao). [Paper][PyTorch]
    • ODISE: "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", CVPR, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • Mask-free-OVIS: "Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations", CVPR, 2023 (Salesforce). [Paper][PyTorch (in construction)]
    • FreeSeg: "FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation", CVPR, 2023 (ByteDance). [Paper]
    • SAN: "Side Adapter Network for Open-Vocabulary Semantic Segmentation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • OVSegmentor: "Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision", CVPR, 2023 (Fudan University). [Paper][PyTorch][Website]
    • PACL: "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", CVPR, 2023 (Meta). [Paper]
    • MaskCLIP: "Open-Vocabulary Universal Image Segmentation with MaskCLIP", ICML, 2023 (UCSD). [Paper][Website]
    • SegCLIP: "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", ICML, 2023 (JD). [Paper][PyTorch]
    • SWORD: "Exploring Transformers for Open-world Instance Segmentation", ICCV, 2023 (HKU). [Paper]
    • Grounded-Diffusion: "Open-vocabulary Object Segmentation with Diffusion Models", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch][Website]
    • SegPrompt: "SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning", ICCV, 2023 (Zhejiang). [Paper][PyTorch]
    • CGG: "Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation", ICCV, 2023 (SenseTime). [Paper][PyTorch][Website]
    • OpenSeeD: "A Simple Framework for Open-Vocabulary Segmentation and Detection", ICCV, 2023 (IDEA). [Paper][PyTorch]
    • OPSNet: "Open-vocabulary Panoptic Segmentation with Embedding Modulation", ICCV, 2023 (HKU). [Paper]
    • GKC: "Global Knowledge Calibration for Fast Open-Vocabulary Segmentation", ICCV, 2023 (ByteDance). [Paper]
    • ZeroSeg: "Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only", ICCV, 2023 (Meta). [Paper]
    • MasQCLIP: "MasQCLIP for Open-Vocabulary Universal Image Segmentation", ICCV, 2023 (UCSD). [Paper][PyTorch][Website]
    • VLPart: "Going Denser with Open-Vocabulary Part Segmentation", ICCV, 2023 (HKU). [Paper][PyTorch]
    • DeOP: "Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network", ICCV, 2023 (Meituan). [Paper]][PyTorch]
    • MixReorg: "MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation", ICCV, 2023 (Sun Yat-sen University). [Paper]
    • OV-PARTS: "OV-PARTS: Towards Open-Vocabulary Part Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • HIPIE: "Hierarchical Open-vocabulary Universal Image Segmentation", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
    • ?: "Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation", NeurIPS, 2023 (Shanghai Jiao Tong). [Paper]
    • FC-CLIP: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP", NeurIPS, 2023 (ByteDance). [Paper][PyTorch]
    • WLSegNet: "A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation", arXiv, 2023 (IIT, New Delhi). [Paper]
    • CAT-Seg: "CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (Korea University). [Paper][PyTorch][Website]
    • MVP-SEG: "MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (Xiaohongshu, China). [Paper]
    • TagCLIP: "TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation", arXiv, 2023 (CUHK). [Paper]
    • OVDiff: "Diffusion Models for Zero-Shot Open-Vocabulary Segmentation", arXiv, 2023 (Oxford). [Paper][Website]
    • UOVN: "Unified Open-Vocabulary Dense Visual Prediction", arXiv, 2023 (Monash University). [Paper]
    • CLIP-DIY: "CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free", arXiv, 2023 (Warsaw University of Technology, Poland). [Paper]
    • Entity: "Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion", arXiv, 2023 (Harbin Engineering University). [Paper][PyTorch]
    • OSM: "Towards Open-Ended Visual Recognition with Large Language Model", arXiv, 2023 (ByteDance). [Paper][PyTorch]
    • SED: "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (Tianjin). [Paper][PyTorch (in construction)]
    • PnP-OVSS: "Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models", arXiv, 2023 (NTU, Singapore). [Paper]
    • SCLIP: "SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference", arXiv, 2023 (JHU). [Paper]
    • GranSAM: "Towards Granularity-adjusted Pixel-level Semantic Annotation", arXiv, 2023 (UC Riverside). [Paper]
    • Sambor: "Boosting Segment Anything Model Towards Open-Vocabulary Learning", arXiv, 2023 (Huawei). [Paper][Code (in construction)]
    • SCAN: "Open-Vocabulary Segmentation with Semantic-Assisted Calibration", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
    • Self-Seg: "Self-Guided Open-Vocabulary Semantic Segmentation", arXiv, 2023 (UvA). [Paper]
    • OpenSD: "OpenSD: Unified Open-Vocabulary Segmentation and Detection", arXiv, 2023 (OPPO). [Paper]
    • CLIP-DINOiser: "CLIP-DINOiser: Teaching CLIP a few DINO tricks", arXiv, 2023 (Warsaw University of Technology, Poland). [Paper][PyTorch]
    • TagAlign: "TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification", arXiv, 2023 (Ant Group). [Paper][PyTorch][Website]
    • OVFoodSeg: "OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation", CVPR, 2024 (Singapore Management University (SMU)). [Paper]
    • FreeDA: "Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation", CVPR, 2024 (University of Modena and Reggio Emilia (UniMoRe), Italy). [Paper][Website]
    • S-Seg: "Exploring Simple Open-Vocabulary Semantic Segmentation", arXiv, 2024 (Oxford). [Paper][Code (in construction)]
    • PosSAM: "PosSAM: Panoptic Open-vocabulary Segment Anything", arXiv, 2024 (Qualcomm). [Paper]][Code (in construction)][Website]
  • LLM-based:
    • LISA: "LISA: Reasoning Segmentation via Large Language Model", arXiv, 2023 (CUHK). [Paper][PyTorch]
    • PixelLM: "PixelLM: Pixel Reasoning with Large Multimodal Model", arXiv, 2023 (ByteDance). [Paper][Code (in construction)][Website]
    • PixelLLM: "Pixel Aligned Language Models", arXiv, 2023 (Google). [Paper][Website]
    • GSVA: "GSVA: Generalized Segmentation via Multimodal Large Language Models", arXiv, 2023 (Tsinghua). [Paper]
    • LISA++: "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 (CUHK). [Paper]
    • GROUNDHOG: "GROUNDHOG: Grounding Large Language Models to Holistic Segmentation", CVPR, 2024 (Amazon). [Paper][Website]
    • PSALM: "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • LLaVASeg: "Empowering Segmentation Ability to Multi-modal Large Language Models", arXiv, 2024 (vivo). [Paper]
    • LaSagnA: "LaSagnA: Language-based Segmentation Assistant for Complex Queries", arXiv, 2024 (Meituan). [Paper][PyTorch]
  • Universal Segmentation:
    • K-Net: "K-Net: Towards Unified Image Segmentation", NeurIPS, 2021 (NTU, Singapore). [Paper][PyTorch]
    • Mask2Former: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
    • MP-Former: "MP-Former: Mask-Piloted Transformer for Image Segmentation", CVPR, 2023 (IDEA). [Paper][Code (in construction)]
    • OneFormer: "OneFormer: One Transformer to Rule Universal Image Segmentation", CVPR, 2023 (Oregon). [Paper][PyTorch][Website]
    • UNINEXT: "Universal Instance Perception as Object Discovery and Retrieval", CVPR, 2023 (ByteDance). [Paper][PyTorch]
    • ClustSeg: "CLUSTSEG: Clustering for Universal Segmentation", ICML, 2023 (Rochester Institute of Technology). [Paper]
    • DaTaSeg: "DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model", NeurIPS, 2023 (Google). [Paper]
    • DFormer: "DFormer: Diffusion-guided Transformer for Universal Image Segmentation", arXiv, 2023 (Tianjin University). [Paper][Code (in construction)]
    • ?: "A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task", arXiv, 2023 (OMRON SINIC X, Japan). [Paper]
    • Mask2Anomaly: "Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation", arXiv, 2023 (Politecnico di Torino, Italy). [Paper]
    • SegGen: "SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis", arXiv, 2023 (Adobe). [Paper][Code (in construction)][Website]
    • PolyMaX: "PolyMaX: General Dense Prediction with Mask Transformer", WACV, 2024 (Google). [Paper][Tensorflow]
    • PEM: "PEM: Prototype-based Efficient MaskFormer for Image Segmentation", CVPR, 2024 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
    • OMG-Seg: "OMG-Seg: Is One Model Good Enough For All Segmentation?", arXiv, 2024 (NTU, Singapore). [Paper][PyTorch][Website]
    • Uni-OVSeg: "Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision", arXiv, 2024 (University of Sydney). [Paper][PyTorch (in construction)]
    • PRO-SCALE: "Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation", arXiv, 2024 (NEC). [Paper]
  • Multi-Modal:
    • UCTNet: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 (Lehigh University, Pennsylvania). [Paper]
    • CMX: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
    • DeLiVER: "Delivering Arbitrary-Modal Semantic Segmentation", CVPR, 2023 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch][Website]
    • DFormer: "DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation", arXiv, 2023 (Nankai University). [Paper][PyTorch]
    • Sigma: "Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation", arXiv, 2024 (CMU). [Paper][PyTorch]
  • Panoptic Segmentation:
    • MaX-DeepLab: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (Google). [Paper][PyTorch (conradry)]
    • SIAin: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 (SI Analytics, South Korea). [Paper]
    • VPS-Transformer: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
    • CMT-DeepLab: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (Google). [Paper]
    • Panoptic-SegFormer: "Panoptic SegFormer", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
    • kMaX-DeepLab: "k-means Mask Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
    • Panoptic-PartFormer: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (Peking). [Paper][PyTorch]
    • CoMFormer: "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", CVPR, 2023 (Sorbonne Université, France). [Paper]
    • YOSO: "You Only Segment Once: Towards Real-Time Panoptic Segmentation", CVPR, 2023 (Xiamen University). [Paper][PyTorch]
    • Pix2Seq-D: "A Generalist Framework for Panoptic Segmentation of Images and Videos", ICCV, 2023 (DeepMind). [Paper][Tensorflow2]
    • DeepDPS: "Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning", ICCV, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
    • ReMaX: "ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation", NeurIPS, 2023 (Google). [Paper][Tensorflow2]
    • PanopticPartFormer++: "PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation", arXiv, 2023 (Peking). [Paper][PyTorch]
    • MaXTron: "MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation", arXiv, 2023 (ByteDance). [Paper]
    • ECLIPSE: "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning", CVPR, 2024 (NAVER). [Paper][Code (in construction)]
  • Instance Segmentation:
    • ISTR: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (Xiamen University). [Paper][PyTorch]
    • Mask-Transfiner: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch][Website]
    • BoundaryFormer: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (UCSD). [Paper]
    • PPT: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 (ByteDance). [Paper]
    • TOIST: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
    • MAL: "Vision Transformers Are Good Mask Auto-Labelers", CVPR, 2023 (NVIDIA). [Paper][PyTorch]
    • FastInst: "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation", CVPR, 2023 (Alibaba). [Paper][PyTorch]
    • SP: "Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt", CVPR, 2023 (Northwestern Polytechnical University, China). [Paper]
    • X-Paste: "X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion", ICML, 2023 (USTC). [Paper][PyTorch]
    • DynaMITe: "DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer", ICCV, 2023 (RWTH Aachen University, Germany). [Paper][PyTorch][Website]
    • Mask-Frozen-DETR: "Mask Frozen-DETR: High Quality Instance Segmentation with One GPU", arXiv, 2023 (Microsoft). [Paper]
  • Optical Flow:
    • CRAFT: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 (A*STAR, Singapore). [Paper][PyTorch]
    • KPA-Flow: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 (Megvii). [Paper][PyTorch (in construction)]
    • GMFlowNet: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 (Rutgers). [Paper][PyTorch]
    • FlowFormer: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 (CUHK). [Paper][Website]
    • TransFlow: "TransFlow: Transformer as Flow Learner", CVPR, 2023 (Rochester Institute of Technology). [Paper]
    • FlowFormer++: "FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation", CVPR, 2023 (CUHK). [Paper]
  • Panoramic Semantic Segmentation:
    • Trans4PASS: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
    • SGAT4PASS: "SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation", IJCAI, 2023 (Tencent). [Paper][Code (in construction)]
    • FlowFormer: "FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow", arXiv, 2023 (CUHK). [Paper]
  • X-Shot:
    • CyCTR: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (University of Technology Sydney). [Paper]
    • CATrans: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (Baidu). [Paper]
    • VAT: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (Korea University). [Paper][PyTorch][Website]
    • DCAMA: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (Tencent). [Paper]
    • AAFormer: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 (USTC). [Paper]
    • IPMT: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (Northwestern Polytechnical University). [Paper][PyTorch]
    • TAFT: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (KAIST). [Paper]
    • MSANet: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (AiV Research Group, Korea). [Paper][PyTorch]
    • MuHS: "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 (Zhejiang University). [Paper]
    • VTM: "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 (KAIST). [Paper][PyTorch]
    • SegGPT: "SegGPT: Segmenting Everything In Context", ICCV, 2023 (BAAI). [Paper][PyTorch]
    • AMFormer: "Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation", NeurIPS, 2023 (ISTC). [Paper][Code (in construction)]
    • RefT: "Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • ?: "Multi-Modal Prototypes for Open-Set Semantic Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • SPINO: "Few-Shot Panoptic Segmentation With Foundation Models", arXiv, 2023 (University of Freiburg, Germany). [Paper][Website]
    • ?: "Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach", CVPR, 2024 (UBC). [Paper]
    • RefLDM-Seg: "Explore In-Context Segmentation via Latent Diffusion Models", arXiv, 2024 (NTU, Singapore). [Paper][Code (in construction)][Website]
    • Chameleon: "Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild", arXiv, 2024 (KAIST). [Paper]
  • X-Supervised:
    • MCTformer: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 (The University of Western Australia). [Paper][Code (in construction)]
    • AFA: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
    • HSG: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 (Berkeley). [Paper][PyTorch]
    • CLIMS: "Cross Language Image Matching for Weakly Supervised Semantic Segmentation", CVPR, 2022 (Shenzhen University). [Paper][PyTorch]
    • ?: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 (Université Paris-Saclay, France). [Paper]
    • SegSwap: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (École des Ponts ParisTech). [Paper][PyTorch][Website]
    • ViT-PCM: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 (Sapienza University, Italy). [Paper][Tensorflow]
    • TransFGU: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • TransCAM: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (University of Toronto). [Paper][PyTorch]
    • WegFormer: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Tongji University, China). [Paper]
    • MaskDistill: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (KU Leuven). [Paper][PyTorch]
    • eX-ViT: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (La Trobe University, Australia). [Paper]
    • TCC: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (Alibaba). [Paper]
    • SemFormer: "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Shenzhen University). [Paper][PyTorch]
    • CLIP-ES: "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
    • ToCo: "Token Contrast for Weakly-Supervised Semantic Segmentation", CVPR, 2023 (JD). [Paper][PyTorch]
    • DPF: "DPF: Learning Dense Prediction Fields with Weak Supervision", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
    • SemiCVT: "SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation", CVPR, 2023 (Zhejiang University). [Paper]
    • AttentionShift: "AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (CAS). [Paper]
    • MMCST: "Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization", CVPR, 2023 (The University of Western Australia). [Paper]
    • SimSeg: "A Simple Framework for Text-Supervised Semantic Segmentation", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
    • SIM: "SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch (in construction)]
    • AttentionShift: "AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (CAS). [Paper]
    • Point2Mask: "Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport", ICCV, 2023 (Zhejiang). [Paper][PyTorch]
    • BoxSnake: "BoxSnake: Polygonal Instance Segmentation with Box Supervision", ICCV, 2023 (Tencent). [Paper]
    • QA-CLIMS: "Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation", ACMMM, 2023 (Shenzhen University). [Paper][Code (in construction)]
    • CoCu: "Bridging Semantic Gaps for Language-Supervised Semantic Segmentation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
    • APro: "Label-efficient Segmentation via Affinity Propagation", NeurIPS, 2023 (Zhejiang). [Paper][PyTorch][Website]
    • PaintSeg: "PaintSeg: Training-free Segmentation via Painting", NeurIPS, 2023 (Microsoft). [Paper]
    • SmooSeg: "SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
    • VLOSS: "Towards Universal Vision-language Omni-supervised Segmentation", arXiv, 2023 (Harbin Institute of Technology). [Paper]
    • MECPformer: "MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Tongji University). [Paper][Code (in construction)]
    • WeakTr: "WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • SAM-WSSS: "An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems", arXiv, 2023 (ANU). [Paper]
    • ?: "Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang University + Nankai University). [Paper]
    • AReAM: "Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang University). [Paper]
    • SEPL: "Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation", arXiv, 2023 (OSU). [Paper][Code (in construction)]
    • MIMIC: "MIMIC: Masked Image Modeling with Image Correspondences", arXiv, 2023 (UW). [Paper][PyTorch]
    • POLE: "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation", arXiv, 2023 (ETS Montreal, Canada). [Paper][PyTorch]
    • GD: "Guided Distillation for Semi-Supervised Instance Segmentation", arXiv, 2023 (Meta). [Paper]
    • MCTformer+: "MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (The University of Western Australia). [Paper][PyTorch]
    • MMC: "Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding", arXiv, 2023 (University of Surrey, UK). [Paper]
    • CRATE: "Emergence of Segmentation with Minimalistic White-Box Transformers", arXiv, 2023 (Berkeley). [Paper][PyTorch]
    • ?: "Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models", arXiv, 2023 (Singapore Management University). [Paper]
    • MCC: "Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang Lab, China). [Paper][PyTorch]
    • CRATE: "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
    • SAMS: "Foundation Model Assisted Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang). [Paper]
    • SemiVL: "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance", arXiv, 2023 (Google). [Paper][PyTorch]
    • Self-reinforcement: "Progressive Uncertain Feature Self-reinforcement for Weakly Supervised Semantic Segmentation", AAAI, 2024 (Zhejiang Lab). [Paper][PyTorch]
    • FeatUp: "FeatUp: A Model-Agnostic Framework for Features at Any Resolution", ICLR, 2024 (MIT). [Paper]
    • Zip-Your-CLIP: "The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models", ICLR, 2024 (ShanghaiTech). [Paper][PyTorch]
    • SeCo: "Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation", CVPR, 2024 (Fudan). [Paper][Code (in construction)]
    • AllSpark: "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation", CVPR, 2024 (HKUST). [Paper][PyTorch]
    • CPAL: "Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation", CVPR, 2024 (Monash University). [Paper][Code (in construction)]
    • DuPL: "DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation", CVPR, 2024 (Shanghai University). [Paper][PyTorch]
    • CoDe: "Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation", CVPR, 2024 (NTU). [Paper][Code (in construction)]
    • SemPLeS: "SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation", arXiv, 2024 (NVIDIA). [Paper]
    • WeakSAM: "WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition", arXiv, 2024 (Huazhong University of Science & Technology (HUST)). [Paper][PyTorch]
    • CoSA: "Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation", arXiv, 2024 (Lancaster University, UK). [Paper][Code (in construction)]
    • CoBra: "CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation", arXiv, 2024 (Yonsei). [Paper]
  • Cross-Domain:
    • DAFormer: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch]
    • HGFormer: "HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation", CVPR, 2023 (Wuhan University). [Paper][Code (in construction)]
    • UniDAformer: "UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration", CVPR, 2023 (NTU, Singapore). [Paper]
    • MIC: "MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation", CVPR, 2023 (ETHZ). [Paper][PyTorch]
    • CDAC: "CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation", ICCV, 2023 (Boston). [Paper][PyTorch]
    • EDAPS: "EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation", ICCV, 2023 (ETHZ). [Paper][PyTorch]
    • PTDiffSeg: "Prompting Diffusion Representations for Cross-Domain Semantic Segmentation", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
    • Rein: "Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation", arXiv, 2023 (USTC). [Paper]
  • Continual Learning:
    • TISS: "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 (Tencent). [Paper]
    • Incrementer: "Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class", CVPR, 2023 (University of Electronic Science and Technology of China). [Paper]
  • Crack Detection:
    • CrackFormer: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
  • Camouflaged/Concealed Object:
    • UGTR: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (Group42, Abu Dhabi). [Paper][PyTorch]
    • COD: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 (Anhui University, China). [Paper][Code (in construction)]
    • OSFormer: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • FSPNet: "Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers", CVPR, 2023 (Sichuan Changhong Electric, China). [Paper][PyTorch][Website]
    • MFG: "Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping", NeurIPS, 2023 (Tsinghua). [Paper][Code (in construction)]
  • Background Separation:
    • TransBlast: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 (University of British Columbia). [Paper]
  • Scene Understanding:
    • BANet: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (Wuhan University). [Paper]
    • Cerberus-Transformer: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • IRISformer: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (UCSD). [Paper][Code (in construction)]
  • 3D Segmentation:
    • Stratified-Transformer: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • CodedVTR: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 (Tsinghua). [Paper]
    • M2F3D: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper][Website]
    • 3DSeg: "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 (The University of Tokyo). [Paper]
    • Analogical-Network: "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 (CMU). [Paper]
    • VoxFormer: "VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion", CVPR, 2023 (NVIDIA). [Paper][PyTorch]
    • GrowSP: "GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
    • RangeViT: "RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving", CVPR, 2023 (Valeo.ai, France). [Paper][Code (in construction)]
    • MeshFormer: "Heat Diffusion based Multi-scale and Geometric Structure-aware Transformer for Mesh Segmentation", CVPR, 2023 (University of Macau). [Paper]
    • MSeg3D: "MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
    • SGVF-SVFE: "See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data", ICCV, 2023 (ShanghaiTech). [Paper]
    • SVQNet: "SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation", ICCV, 2023 (Tsinghua). [Paper]
    • MAF-Transformer: "Mask-Attention-Free Transformer for 3D Instance Segmentation", ICCV, 2023 (CUHK). [Paper][PyTorch]
    • UniSeg: "UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • MIT: "2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision", ICCV, 2023 (NTU). [Paper]
    • CVSformer: "CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion", ICCV, 2023 (Tianjin University). [Paper]
    • SPT: "Efficient 3D Semantic Segmentation with Superpoint Transformer", ICCV, 2023 (Univ Gustave Eiffel, France). [Paper][PyTorch]
    • SATR: "SATR: Zero-Shot Semantic Segmentation of 3D Shapes", ICCV, 2023 (KAUST). [Paper][PyTorch][Website]
    • 3D-OWIS: "3D Indoor Instance Segmentation in an Open-World", NeurIPS, 2023 (MBZUAI). [Paper]
    • SA3D: "Segment Anything in 3D with NeRFs", NeurIPS, 2023 (SJTU). [Paper][PyTorch][Website]
    • Contrastive-Lift: "Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion", NeurIPS, 2023 (Oxford). [Paper][PyTorch][Website]
    • P3Former: "Position-Guided Point Cloud Panoptic Segmentation Transformer", arXiv, 2023 (1Shanghai AI Lab). [Paper][Code (in construction)]
    • UnScene3D: "UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes", arXiv, 2023 (TUM). [Paper][Website]
    • CNS: "Towards Label-free Scene Understanding by Vision Foundation Models", NeurIPS, 2023 (HKU). [Paper][Code (in construction)]
    • DCTNet: "Dynamic Clustering Transformer Network for Point Cloud Segmentation", arXiv, 2023 (University of Waterloo, Waterloo, Canada). [Paper]
    • Symphonies: "Symphonize 3D Semantic Scene Completion with Contextual Instance Queries", arXiv, 2023 (Horizon Robotics). [Paper][PyTorch]
    • TFS3D: "Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks", arXiv, 2023 (CUHK). [Paper][PyTorch]
    • CIP-WPIS: "When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision", arXiv, 2023 (Australian National University). [Paper]
    • ?: "SAM-guided Unsupervised Domain Adaptation for 3D Segmentation", arXiv, 2023 (ShanghaiTech). [Paper]
    • CSF: "Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation", arXiv, 2023 (NTU, Singapore). [Paper]
    • ?: "Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation", arXiv, 2023 (Oxford). [Paper]
    • OneFormer3D: "OneFormer3D: One Transformer for Unified Point Cloud Segmentation", arXiv, 2023 (Samsung). [Paper]
    • SAGA: "Segment Any 3D Gaussians", arXiv, 2023 (SJTU). [Paper][Code (in construction)][Website]
    • SANeRF-HQ: "SANeRF-HQ: Segment Anything for NeRF in High Quality", arXiv, 2023 (HKUST). [Paper][Code (in construction)][Website]
    • SAM-Graph: "SAM-guided Graph Cut for 3D Instance Segmentation", arXiv, 2023 (Zhejiang). [Paper][Code (in construction)][Website]
    • SAI3D: "SAI3D: Segment Any Instance in 3D Scenes", arXiv, 2023 (Peking). [Paper]
    • COSeg: "Rethinking Few-shot 3D Point Cloud Semantic Segmentation", CVPR, 2024 (ETHZ). [Paper][Code (in construction)]
    • CSC: "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception", CVPR, 2024 (East China Normal University). [Paper][Code (in construction)]
  • Multi-Task:
    • InvPT: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 (HKUST). [Paper][PyTorch]
    • MTFormer: "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 (CUHK). [Paper]
    • MQTransformer: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 (Wuhan University). [Paper]
    • DeMT: "DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction", AAAI, 2023 (Wuhan University). [Paper][PyTorch]
    • TaskPrompter: "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 (HKUST). [Paper][PyTorch (in construction)]
    • AiT: "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token", ICCV, 2023 (Microsoft). [Paper][PyTorch]
    • InvPT++: "InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding", arXiv, 2023 (HKUST). [Paper]
    • DeMTG: "Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction", arXiv, 2023 (Wuhan University). [Paper][PyTorch]
    • SRT: "Sub-token ViT Embedding via Stochastic Resonance Transformers", arXiv, 2023 (UCLA). [Paper]
    • MLoRE: "Multi-Task Dense Prediction via Mixture of Low-Rank Experts", CVPR, 2024 (vivo). [Paper]
    • ODIN: "ODIN: A Single Model for 2D and 3D Perception", arXiv, 2024 (CMU). [Paper][Code (in construction)][Website]
    • LiFT: "LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors", arXiv, 2024 (Maryland). [Paper]
  • Forecasting:
    • DiffAttn: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 (UIUC). [Paper][Code (in construction)]
  • LiDAR:
    • HelixNet: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 (CNRS, France). [Paper][Website][PyTorch]
    • Gaussian-Radar-Transformer: "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 (University of Bonn, Germany). [Paper]
    • MOST: "Lidar Panoptic Segmentation and Tracking without Bells and Whistles", IROS, 2023 (CMU). [Paper][PyTorch]
    • 4D-Former: "4D-Former: Multimodal 4D Panoptic Segmentation", CoRL, 2023 (Waabi, Canada). [Paper][Website]
    • MASK4D: "MASK4D: Mask Transformer for 4D Panoptic Segmentation", arXiv, 2023 (RWTH Aachen University, Germany). [Paper]
  • Co-Segmentation:
    • ReCo: "ReCo: Retrieve and Co-segment for Zero-shot Transfer", NeurIPS, 2022 (Oxford). [Paper][PyTorch][Website]
    • DINO-ViT-feature: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
    • LCCo: "LCCo: Lending CLIP to Co-Segmentation", arXiv, 2023 (Beijing Institute of Technology). [Paper]
  • Top-Down Semantic Segmentation:
    • Trans4Map: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
  • Surface Normal:
    • Normal-Transformer: "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 (University of Technology Sydney). [Paper]
  • Applications:
    • FloodTransformer: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (BITS Pilani, India). [Paper]
  • Diffusion:
    • VPD: "Unleashing Text-to-Image Diffusion Models for Visual Perception", ICCV, 2023 (Tsinghua University). [Paper][PyTorch][Website]
    • Dataset-Diffusion: "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation", NeurIPS, 2023 (VinAI, Vietnam). [Paper][PyTorch][Website]
    • SegRefiner: "SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process", NeurIPS, 2023 (ByteDance). [Paper][PyTorch]
    • DatasetDM: "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models", NeurIPS, 2023 (Zhejiang). [Paper][PyTorch][Website]
    • DiffSeg: "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion", arXiv, 2023 (Georgia Tech). [Paper]
    • DiffSegmenter: "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter", arXiv, 2023 (Beihang University). [Paper]
    • ?: "From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models", arXiv, 2023 (Tsinghua). [Paper]
    • LDMSeg: "A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting", arXiv, 2024 (Segments.ai, Belgium). [Paper][PyTorch]
  • Low-Level Structure Segmentation:
    • EVP: "Explicit Visual Prompting for Low-Level Structure Segmentations", CVPR, 2023. (Tencent). [Paper][PyTorch]
    • EVP: "Explicit Visual Prompting for Universal Foreground Segmentations", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • EmerDiff: "EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models", arXiv, 2024 (NVIDIA). [Paper][Website]
  • Zero-Guidance Segmentation:
    • zero-guide-seg: "Zero-guidance Segmentation Using Zero Segment Labels", arXiv, 2023 (VISTEC, Thailand). [Paper][Website]
  • Part Segmentation:
    • OPS: "Towards Open-World Segmentation of Parts", CVPR, 2023 (Adobe). [Paper][PyTorch]
    • PartDistillation: "PartDistillation: Learning Parts from Instance Segmentation", CVPR, 2023 (Meta). [Paper]
  • Entity Segmentation:
    • AIMS: "AIMS: All-Inclusive Multi-Level Segmentation", NeurIPS, 2023 (UC Merced). [Paper][PyTorch]
    • SOHES: "SOHES: Self-supervised Open-world Hierarchical Entity Segmentation", ICLR, 2024 (Adobe). [Paper][Website]
  • Evaluation:
    • ?: "Robustness Analysis on Foundational Segmentation Models", arXiv, 2023 (UCF). [Paper][PyTorch]
  • Interactive Segmentation:
    • InterFormer: "InterFormer: Real-time Interactive Image Segmentation", ICCV, 2023 (Xiamen University). [Paper][PyTorch]
    • SimpleClick: "SimpleClick: Interactive Image Segmentation with Simple Vision Transformers", ICCV, 2023 (UNC). [Paper][PyTorch]
    • iCMFormer: "Interactive Image Segmentation with Cross-Modality Vision Transformers", arXiv, 2023 (University of Twente, Netherlands). [Paper][Code (in construction)]
    • MFP: "MFP: Making Full Use of Probability Maps for Interactive Image Segmentation", CVPR, 2024 (Korea University). [Paper][Code (in construction)]
    • GraCo: "GraCo: Granularity-Controllable Interactive Segmentation", CVPR, 2024 (Peking). [Paper][Website]
  • Amodal Segmentation:
    • AISFormer: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (University of Arkansas, Arkansas). [Paper][PyTorch]
    • C2F-Seg: "Coarse-to-Fine Amodal Segmentation with Shape Prior", ICCV, 2023 (Fudan). [Paper][Code (in construction)][Website]
    • EoRaS: "Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation", ICCV, 2023 (Fudan). [Paper][Code (in construction)]
    • MP3D-Amodal: "Amodal Ground Truth and Completion in the Wild", arXiv, 2023 (Oxford). [Paper][Website (in construction)]
  • Amonaly Segmentation:
    • Mask2Anomaly: "Unmasking Anomalies in Road-Scene Segmentation", ICCV, 2023 (Politecnico di Torino, Italy). [Paper][PyTorch]
  • In-Context Segmentation:
    • SEGIC: "SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation", arXiv, 2023 (Fudan). [Paper][Code (in construction)]

[Back to Overview]

Video (High-level)

Action Recognition

  • RGB mainly
    • Action Transformer: "Video Action Transformer Network", CVPR, 2019 (DeepMind). [Paper][Code (ppriyank)]
    • ViViT-Ensemble: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (Alibaba). [Paper]
    • TimeSformer: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (Facebook). [Paper][PyTorch (lucidrains)]
    • MViT: "Multiscale Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
    • VidTr: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (Amazon). [Paper][PyTorch]
    • ViViT: "ViViT: A Video Vision Transformer", ICCV, 2021 (Google). [Paper][PyTorch (rishikksh20)]
    • VTN: "Video Transformer Network", ICCVW, 2021 (Theator). [Paper][PyTorch]
    • TokShift: "Token Shift Transformer for Video Classification", ACMMM, 2021 (CUHK). [Paper][PyTorch]
    • Motionformer: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (Facebook). [Paper][PyTorch][Website]
    • X-ViT: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (Samsung). [Paper][PyTorch]
    • SCT: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (Kuaishou). [Paper]
    • RSANet: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (POSTECH). [Paper][PyTorch][Website]
    • STAM: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (Alibaba). [Paper][Code]
    • GAT: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (Samsung). [Paper]
    • TokenLearner: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (Google). [Paper]
    • VLF: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (The University of Sheffield). [Paper]
    • UniFormer: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (CAS + SenstTime). [Paper][PyTorch]
    • Video-Swin: "Video Swin Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • DirecFormer: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (University of Arkansas). [Paper][Code (in construction)]
    • DVT: "Deformable Video Transformer", CVPR, 2022 (Meta). [Paper]
    • MeMViT: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (Meta). [Paper]
    • MLP-3D: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (JD). [Paper][PyTorch (in construction)]
    • RViT: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (TCL Corporate Research, HK). [Paper]
    • SIFA: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (JD). [Paper][PyTorch]
    • MViTv2: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (Meta). [Paper][PyTorch]
    • MTV: "Multiview Transformers for Video Recognition", CVPR, 2022 (Google). [Paper][Tensorflow]
    • ORViT: "Object-Region Video Transformers", CVPR, 2022 (Tel Aviv). [Paper][Website]
    • TIME: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (KAIST). [Paper][PyTorch]
    • TPS: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • DualFormer: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • STTS: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (Fudan University). [Paper][PyTorch]
    • Turbo: "Turbo Training with Token Dropout", BMVC, 2022 (Oxford). [Paper]
    • MultiTrain: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
    • SViT: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 (Tel Aviv). [Paper][Website]
    • ST-Adapter: "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 (CUHK). [Paper][Code (in construction)]
    • ATA: "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 (Microsoft). [Paper]
    • AIA: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (University of Science and Technology of China). [Paper][PyTorch]
    • MSCA: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (Nagoya Institute of Technology). [Paper]
    • VAST: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (Samsung). [Paper]
    • Video-MobileFormer: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (Microsoft). [Paper]
    • MAM2: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (Baidu). [Paper]
    • ?: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (SenseTime). [Paper]
    • STAN: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (The University of Surrey, UK). [Paper]
    • PatchBlender: "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 (Mila). [Paper]
    • DualPath: "Dual-path Adaptation from Image to Video Transformers", CVPR, 2023 (Yonsei University). [Paper][PyTorch (in construction)]
    • S-ViT: "Streaming Video Model", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
    • TubeViT: "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", CVPR, 2023 (Google). [Paper]
    • AdaMAE: "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", CVPR, 2023 (JHU). [Paper][PyTorch]
    • ObjectViViT: "How can objects help action recognition?", CVPR, 2023 (Google). [Paper]
    • SMViT: "Simple MViT: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (Meta). [Paper]
    • Hiera: "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (Meta). [Paper][PyTorch]
    • Video-FocalNet: "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
    • ATM: "What Can Simple Arithmetic Operations Do for Temporal Modeling?", ICCV, 2023 (Baidu). [Paper][Code (in construction)]
    • STA: "Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation", ICCV, 2023 (Huawei). [Paper]
    • Helping-Hands: "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model", ICCV, 2023 (Oxford). [Paper][PyTorch]
    • SUM-L: "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition", ICCV, 2023 (University of Delaware, Delaware). [Paper][Code (in construction)]
    • BEAR: "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition", ICCV, 2023 (UCF). [Paper][GitHub]
    • UniFormerV2: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", ICCV, 2023 (CAS). [Paper][PyTorch]
    • CAST: "CAST: Cross-Attention in Space and Time for Video Action Recognition", NeurIPS, 2023 (Kyung Hee University). [Paper][PyTorch][Website]
    • PPMA: "Learning Human Action Recognition Representations Without Real Humans", NeurIPS (Datasets and Benchmarks), 2023 (IBM). [Paper][PyTorch]
    • SVT: "SVT: Supertoken Video Transformer for Efficient Video Understanding", arXiv, 2023 (Meta). [Paper]
    • PLAR: "Prompt Learning for Action Recognition", arXiv, 2023 (Maryland). [Paper]
    • SFA-ViViT: "Optimizing ViViT Training: Time and Memory Reduction for Action Recognition", arXiv, 2023 (Google). [Paper]
    • TAdaConv: "Temporally-Adaptive Models for Efficient Video Understanding", arXiv, 2023 (NUS). [Paper][PyTorch]
    • ZeroI2V: "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video", arXiv, 2023 (Nanjing University). [Paper]
    • MV-Former: "Multi-entity Video Transformers for Fine-Grained Video Representation Learning", arXiv, 2023 (Meta). [Paper][PyTorch]
    • GeoDeformer: "GeoDeformer: Geometric Deformable Transformer for Action Recognition", arXiv, 2023 (HKUST). [Paper]
    • Early-ViT: "Early Action Recognition with Action Prototypes", arXiv, 2023 (Amazon). [Paper]
    • MCA: "Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition", ICLR, 2024 (Northeastern University). [Paper][PyTorch]
    • StructViT: "Learning Correlation Structures for Vision Transformers", CVPR, 2024 (POSTECH). [Paper]
    • VideoMamba: "VideoMamba: State Space Model for Efficient Video Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • Video-Mamba-Suite: "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
  • Depth:
    • Trear: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 (Tianjing University). [Paper]
  • Pose/Skeleton:
    • ST-TR: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (Polytechnic University of Milan). [Paper]
    • AcT: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
    • STAR: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (UCLA). [Paper]
    • GCsT: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (CAS). [Paper]
    • GL-Transformer: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
    • ?: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (University of Delaware). [Paper]
    • FG-STFormer: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (Zhengzhou University). [Paper]
    • STTFormer: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (Xidian University). [Paper][Code (in construction)]
    • ProFormer: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
    • ?: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (Harbin Institute of Technology). [Paper]
    • HyperSA: "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 (University of Mannheim, Germany). [Paper]
    • STAR-Transformer: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (Keimyung University, Korea). [Paper]
    • STMT: "STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition", CVPR, 2023 (CMU). [Paper][Code (in construction)]
    • SkeletonMAE: "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training", ICCV, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
    • MAMP: "Masked Motion Predictors are Strong 3D Action Representation Learners", ICCV, 2023 (USTC). [Paper][PyTorch]
    • LAC: "LAC - Latent Action Composition for Skeleton-based Action Segmentation", ICCV, 2023 (INRIA). [Paper][Website]
    • SkeleTR: "SkeleTR: Towards Skeleton-based Action Recognition in the Wild", ICCV, 2023 (Amazon). [Paper]
    • PCM3: "Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning", ACMMM, 2023 (Peking). [Paper][Website]
    • PoseAwareVT: "Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers", arXiv, 2023 (Amazon). [Paper][PyTorch]
    • HandFormer: "On the Utility of 3D Hand Poses for Action Recognition", arXiv, 2024 (NUS). [Paper][Code (in construction)][Website]
    • SkateFormer: "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition", arXiv, 2024 (KAIST). [Paper][Code (in construction)][Website]
  • Multi-modal:
    • MBT: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (Google). [Paper]
    • MM-ViT: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (OPPO). [Paper]
    • MMT-NCRC: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (UCF). [Paper][Code (in construction)]
    • M&M: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (Google). [Paper]
    • VT-CE: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (A*STAR). [Paper]
    • Hi-TRS: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (Rutgers). [Paper][PyTorch]
    • MVFT: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (Alibaba). [Paper]
    • MOV: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (Google). [Paper]
    • 3Mformer: "3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition", CVPR, 2023 (ANU). [Paper]
    • UMT: "On Uni-Modal Feature Learning in Supervised Multi-Modal Learning", ICML, 2023 (Tsinghua). [Paper]
    • ?: "Multimodal Distillation for Egocentric Action Recognition", ICCV, 2023 (KU Leuven). [Paper]
    • MotionBERT: "MotionBERT: Unified Pretraining for Human Motion Analysis", ICCV, 2023 (Peking University). [Paper][PyTorch][Website]
    • TIM: "TIM: A Time Interval Machine for Audio-Visual Action Recognition", CVPR, 2024 (University of Bristol + Oxford). [Paper][PyTorch][Website]
  • Group Activity:
    • GroupFormer: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (Sensetime). [Paper]
    • ?: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (Hitachi). [Paper]
    • GAFL: "Learning Group Activity Features Through Person Attribute Prediction", CVPR, 2024 (Toyota Technological Institute, Japan). [Paper]

[Back to Overview]

Action Detection/Localization

  • OadTR: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • RTD-Net: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • FS-TAL: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • LSTR: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (Amazon). [Paper][PyTorch][Website]
  • ATAG: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (Alibaba). [Paper]
  • TAPG-Transformer: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (Harbin Institute of Technology). [Paper]
  • TadTR: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (Alibaba). [Paper][Code (in construction)]
  • Vidpress-Soccer: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (Baidu). [Paper][GitHub]
  • MS-TCT: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (INRIA). [Paper][PyTorch]
  • UGPT: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • TubeR: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (Amazon). [Paper]
  • DDM-Net: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
  • ?: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (ByteDance). [Paper][PyTorch]
  • ?: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (Renmin University of China). [Paper]
  • EAMAT: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (Beijing Institute of Technology). [Paper][Code (in construction)]
  • STPT: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (Monash University, Australia). [Paper]
  • TeSTra: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (UT Austin). [Paper][PyTorch]
  • TALLFormer: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (UNC). [Paper][PyTorch]
  • ?: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • ActionFormer: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (UW-Madison). [Paper][PyTorch]
  • ActionFormer: "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper][Pytorch]
  • CoOadTR: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (Aarhus University, Denmark). [Paper][PyTorch]
  • Temporal-Perceiver: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (Nanjing University). [Paper]
  • LocATe: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (Stanford). [Paper]
  • HTNet: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (Korea University). [Paper]
  • AdaPerFormer: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (Tianjin University). [Paper]
  • CWC-Trans: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (Meituan). [Paper]
  • HIT: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 (NTHU). [Paper][PyTorch]
  • LART: "On the Benefits of 3D Pose and Tracking for Human Action Recognition", CVPR, 2023 (Meta). [Paper][Website]
  • TranS4mer: "Efficient Movie Scene Detection using State-Space Transformers", CVPR, 2023 (Comcast). [Paper]
  • TTM: "Token Turing Machines", CVPR, 2023 (Google). [Paper][JAX]
  • ?: "Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection", CVPR, 2023 (NAVER). [Paper]
  • Self-DETR: "Self-Feedback DETR for Temporal Action Detection", ICCV, 2023 (Sungkyunkwan University, Korea). [Paper]
  • UnLoc: "UnLoc: A Unified Framework for Video Localization Tasks", ICCV, 2023 (Google). [Paper][JAX]
  • EVAD: "Efficient Video Action Detection with Token Dropout and Context Refinement", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
  • MS-DETR: "MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction", ACL, 2023 (NTU, Singapore). [Paper][PyTorch]
  • STAR: "End-to-End Spatio-Temporal Action Localisation with Video Transformers", arXiv, 2023 (Google). [Paper]
  • DiffTAD: "DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion", arXiv, 2023 (University of Surrey, UK). [Paper][PyTorch (in construction)]
  • MNA-ZBD: "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection", arXiv, 2023 (Renmin University of China). [Paper]
  • PAT: "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection", arXiv, 2023 (University of Surrey, UK). [Paper]
  • ViT-TAD: "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos", arXiv, 2023 (Nanjing University (NJU)). [Paper]
  • Cafe: "Towards More Practical Group Activity Detection: A New Benchmark and Model", arXiv, 2023 (POSTECH). [Paper][PyTorch][Website]
  • ?: "Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization", arXiv, 2023 (Queen Mary, UK). [Paper]
  • SMAST: "A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection", TPAMI, 2024 (University of Virginia). [Paper]
  • OV-STAD: "Open-Vocabulary Spatio-Temporal Action Detection", arXiv, 2024 (Nanjing University). [Paper]

[Back to Overview]

Action Prediction/Anticipation

  • AVT: "Anticipative Video Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • TTPP: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 (CAS). [Paper]
  • HORST: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (NVIDIA). [Paper][PyTorch]
  • ?: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (A*STAR). [Paper]
  • FUTR: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (POSTECH). [Paper]
  • VPTR: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (Polytechnique Montreal, Canada). [Paper][PyTorch]
  • Earthformer: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 (Amazon). [Paper]
  • InAViT: "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 (A*STAR). [Paper]
  • VPTR: "Video Prediction by Efficient Transformers", IVC, 2022 (Polytechnique Montreal, Canada). [Paper][Pytorch]
  • AFFT: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
  • GliTr: "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 (McGill University, Canada). [Paper]
  • RAFTformer: "Latency Matters: Real-Time Action Forecasting Transformer", CVPR, 2023 (Honda). [Paper]
  • AdamsFormer: "AdamsFormer for Spatial Action Localization in the Future", CVPR, 2023 (Honda). [Paper]
  • TemPr: "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction", CVPR, 2023 (University of Bristol). [Paper][PyTorch][Website]
  • MAT: "Memory-and-Anticipation Transformer for Online Action Understanding", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
  • SwinLSTM: "SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM", ICCV, 2023 (Hainan University). [Paper][PyTorch]
  • MVP: "Multiscale Video Pretraining for Long-Term Activity Forecasting", arXiv, 2023 (Boston). [Paper]
  • DiffAnt: "DiffAnt: Diffusion Models for Action Anticipation", arXiv, 2023 (Karlsruhe Institute of Technology (KIT), Germany). [Paper]
  • LALM: "LALM: Long-Term Action Anticipation with Language Models", arXiv, 2023 (ETHZ). [Paper]
  • ?: "Learning from One Continuous Video Stream", arXiv, 2023 (DeepMind). [Paper]
  • ObjectPrompt: "Object-centric Video Representation for Long-term Action Anticipation", WACV, 2024 (Honda). [Paper][Code (in construction)]

[Back to Overview]

Video Object Segmentation

  • GC: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 (Tencent). [Paper]
  • SSTVOS: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (Modiface). [Paper][Code (in construction)]
  • JOINT: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
  • AOT: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (Zhejiang University). [Paper][PyTorch (yoxu515)][Code (in construction)]
  • TransVOS: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (Zhejiang University). [Paper]
  • SITVOS: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (JD). [Paper]
  • HODOR: "Differentiable Soft-Masked Attention", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper]
  • BATMAN: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (Microsoft). [Paper]
  • DeAOT: "Decoupling Features in Hierarchical Propagation for Video Object Segmentation", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch]
  • AOT: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
  • MED-VT: "MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation", CVPR, 2023 (York University). [Paper][Website]
  • ?: "Boosting Video Object Segmentation via Space-time Correspondence Learning", CVPR, 2023 (Shanghai Jiao Tong University (SJTU)). [Paper]
  • Isomer: "Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation", ICCV, 2023 (Dalian University of Technology). [Paper][PyTorch]
  • SimVOS: "Scalable Video Object Segmentation with Simplified Framework", ICCV, 2023 (CUHK). [Paper]
  • MITS: "Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation", ICCV, 2023 (Zhejiang University). [Paper][PyTorch]
  • VIPMT: "Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation", ICCV, 2023 (MBZUAI). [Paper][Code (in construction)]
  • MOSE: "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes", ICCV, 2023 (NTU, Singapore). [Paper][GitHub][Website]
  • LVOS: "LVOS: A Benchmark for Long-term Video Object Segmentation", ICCV, 2023 (Fudan). [Paper][GitHub][Website]
  • JointFormer: "Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation", arXiv, 2023 (Nanjing University). [Paper]
  • PanoVOS: "PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation", arXiv, 2023 (Fudan). [Paper][Code (in construction)][Website]
  • Cutie: "Putting the Object Back into Video Object Segmentation", arXiv, 2023 (UIUC). [Paper][PyTorch][Website]
  • M3T: "M3T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking", arXiv, 2023 (UBC). [Paper]
  • ?: "Appearance-based Refinement for Object-Centric Motion Segmentation", arXiv, 2023 (Oxford). [Paper]
  • DATTT: "Depth-aware Test-Time Training for Zero-shot Video Object Segmentation", CVPR, 2024 (University of Macau). [Paper][PyTorch][Website]
  • LLE-VOS: "Event-assisted Low-Light Video Object Segmentation", CVPR, 2024 (USTC). [Paper]
  • Point-VOS: "Point-VOS: Pointing Up Video Object Segmentation", arXiv, 2024 (RWTH Aachen University, Germany). [Paper][Website]
  • MAVOS: "Efficient Video Object Segmentation via Modulated Cross-Attention Memory", arXiv, 2024 (MBZUAI). [Paper][Code (in construction)]
  • STMA: "Spatial-Temporal Multi-level Association for Video Object Segmentation", arXiv, 2024 (Harbin Institute of Technology). [Paper]
  • Flow-SAM: "Moving Object Segmentation: All You Need Is SAM (and Flow)", arXiv, 2024 (Oxford). [Paper][Website]
  • LVOSv2: "LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation", arXiv, 2024 (Fudan). [Paper][GitHub][Website]

[Back to Overview]

Video Instance Segmentation

  • VisTR: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (Meituan). [Paper][PyTorch]
  • IFC: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (Yonsei University). [Paper][PyTorch]
  • Deformable-VisTR: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (University at Buffalo). [Paper][Code (in construction)]
  • TeViT: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • GMP-VIS: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (Shandong University). [Paper]
  • VMT: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (ETHZ). [Paper][GitHub][Website]
  • SeqFormer: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
  • MS-STS: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • MinVIS: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 (NVIDIA). [Paper][PyTorch]
  • VITA: "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 (Yonsei University). [Paper][PyTorch]
  • IFR: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (Microsoft). [Paper]
  • DeVIS: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (TUM). [Paper][PyTorch]
  • InstanceFormer: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (Ludwig Maximilian University of Munich). [Paper][Code (in construction)]
  • MaskFreeVIS: "Mask-Free Video Instance Segmentation", CVPR, 2023 (ETHZ). [Paper][PyTorch]
  • MDQE: "MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos", CVPR, 2023 (Hong Kong Polytechnic University). [Paper][PyTorch]
  • GenVIS: "A Generalized Framework for Video Instance Segmentation", CVPR, 2023 (Yonsei). [Paper][PyTorch]
  • CTVIS: "CTVIS: Consistent Training for Online Video Instance Segmentation", ICCV, 2023 (Zhejiang University). [Paper][PyTorch]
  • TCOVIS: "TCOVIS: Temporally Consistent Online Video Instance Segmentation", ICCV, 2023 (Tsinghua). [Paper][Code (in construction)]
  • DVIS: "DVIS: Decoupled Video Instance Segmentation Framework", ICCV, 2023 (Wuhan University). [Paper][PyTorch]
  • TMT-VIS: "TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation", NeurIPS, 2023 (HKU). [Paper][Code (in construction)]
  • BoxVIS: "BoxVIS: Video Instance Segmentation with Box Annotations", arXiv, 2023 (Hong Kong Polytechnic University). [Paper][Code (in construction)]
  • OW-VISFormer: "Video Instance Segmentation in an Open-World", arXiv, 2023 (MBZUAI). [Paper][Code (in construction)]
  • GRAtt-VIS: "GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation", arXiv, 2023 (LMU Munich). [Paper][Code (in construction)]
  • RefineVIS: "RefineVIS: Video Instance Segmentation with Temporal Attention Refinement", arXiv, 2023 (Microsoft). [Paper]
  • VideoCutLER: "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation", arXiv, 2023 (Meta). [Paper][PyTorch]
  • NOVIS: "NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation", arXiv, 2023 (TUM). [Paper]
  • VISAGE: "VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement", arXiv, 2023 (Yonsei). [Paper][Code (in construction)]
  • OW-VISCap: "OW-VISCap: Open-World Video Instance Segmentation and Captioning", arXiv, 2024 (UIUC). [Paper][Website]
  • PointVIS: "What is Point Supervision Worth in Video Instance Segmentation?", arXiv, 2024 (NVIDIA). [Paper]

[Back to Overview]

Other Video Tasks

  • Action Segmentation
    • ASFormer: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (Peking University). [Paper][PyTorch]
    • Bridge-Prompt: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • SC-Transformer++: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (CAS). [Paper][Code (in construction)]
    • UVAST: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 (Bosch). [Paper][PyTorch]
    • ?: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (TUM). [Paper]
    • CETNet: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (Shijiazhuang Tiedao University). [Paper]
    • EUT: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (CAS). [Paper]
    • SC-Transformer: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (CAS). [Paper]
    • DXFormer: "Enhancing Transformer Backbone for Egocentric Video Action Segmentation", CVPRW, 2023 (Northeastern University). [Paper][Website (in construction)]
    • LTContext: "How Much Temporal Long-Term Context is Needed for Action Segmentation?", ICCV, 2023 (University of Bonn). [Paper][PyTorch]
    • DiffAct: "Diffusion Action Segmentation", ICCV, 2023 (The University of Sydney). [Paper][PyTorch]
    • TST: "Temporal Segment Transformer for Action Segmentation", arXiv, 2023 (Shanghai Tech). [Paper]
  • Video X Segmentation:
    • STT: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (Shanghai Jiao Tong). [Paper]
    • CFFM: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (ETH Zurich). [Paper][PyTorch]
    • TF-DL: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (Google). [Paper]
    • Video-K-Net: "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation", CVPR, 2022 (Peking University). [Paper][PyTorch]
    • MRCFA: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (ETH Zurich). [Paper][PyTorch]
    • PolyphonicFormer: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (Wuhan University). [Paper][Code (in construction)]
    • ?: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
    • CAROQ: "Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation", CVPR, 2023 (UIUC). [Paper][PyTorch][Website]
    • TarViS: "TarViS: A Unified Approach for Target-based Video Segmentation", CVPR, 2023 (RWTH Aachen University, Germany). [Paper][PyTorch]
    • MEGA: "MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation", ICCV, 2023 (Amazon). [Paper]
    • DEVA: "Tracking Anything with Decoupled Video Segmentation", ICCV, 2023 (UIUC). [Paper][PyTorch][Website]
    • Tube-Link: "Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch]
    • THE-Mask: "Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation", BMVC, 2023 (ETHZ). [Paper][Code (in construction)]
    • MPVSS: "Mask Propagation for Efficient Video Semantic Segmentation", NeurIPS, 2023 (Monash University, Australia). [Paper][Code (in construction)]
    • Video-kMaX: "Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation", arXiv, 2023 (Google). [Paper]
    • SAM-PT: "Segment Anything Meets Point Tracking", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
    • TTT-MAE: "Test-Time Training on Video Streams", arXiv, 2023 (Berkeley). [Paper][Website]
    • UniVS: "UniVS: Unified and Universal Video Segmentation with Prompts as Queries", CVPR, 2024 (OPPO). [Paper][PyTorch][Website]
    • DVIS++: "DVIS++: Improved Decoupled Framework for Universal Video Segmentation", arXiv, 2024 (Wuhan University). [Paper][PyTorch]
    • SAM-PD: "SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising", arXiv, 2024 (Zhejiang). [Paper][PyTorch (in construction)]
    • OneVOS: "OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework", arXiv, 2024 (Fudan). [Paper]
  • Video Object Detection:
    • TransVOD: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (Shanghai Jiao Tong + SenseTime). [Paper][Code (in construction)]
    • MODETR: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-MTL: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-DETR: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (Valeo, Egypt). [Paper]
    • PTSEFormer: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
    • TransVOD: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (Shanghai Jiao Tong + SenseTime). [Paper]
    • ?: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (Zenseact, Sweden). [Paper]
    • ClipVID: "Identity-Consistent Aggregation for Video Object Detection", ICCV, 2023 (University of Adelaide, Australia). [Paper][Code (in construction)]
    • OCL: "Unsupervised Open-Vocabulary Object Localization in Videos", ICCV, 2023 (Amazon). [Paper]
    • CETR: "Context Enhanced Transformer for Single Image Object Detection", AAAI, 2024 (Korea University). [Paper][Code (in construction)][Website]
  • Dense Video Tasks (Detection + Segmentation):
    • TDViT: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 (Queen's University Belfast, UK). [Paper][Code (in construction)]
    • FAQ: "Feature Aggregated Queries for Transformer-Based Video Object Detectors", CVPR, 2023 (UCF). [Paper][PyTorch]
    • Video-OWL-ViT: "Video OWL-ViT: Temporally-consistent open-world localization in video", ICCV, 2023 (DeepMind). [Paper]
  • Video Retrieval:
    • SVRTN: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (Alibaba). [Paper]
  • Video Hashing:
    • BTH: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 (Tsinghua). [Paper][PyTorch]
  • Video-Language:
    • ActionCLIP: "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • ?: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (Shanghai Jiao Tong + Oxford). [Paper][PyTorch][Website]
    • X-CLIP: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • EVL: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (CUHK). [Paper][PyTorch (in construction)]
    • STALE: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (University of Surrey, UK). [Paper][Code (in construction)]
    • ?: "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 (Beijing Laboratory of Intelligent Information Technology). [Paper]
    • VLG: "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 (Nanjing University). [Paper]
    • InternVideo: "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
    • PromptonomyViT: "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 (Tel Aviv + IBM). [Paper]
    • MUPPET: "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 (Meta). [Paper][Code (in construction)]
    • MovieCLIP: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 (USC). [Paper][Website]
    • TranZAD: "Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection", WACV, 2023 (UC Riverside). [Paper]
    • Text4Vis: "Revisiting Classifier: Transferring Vision-Language Models for Video Recognition", AAAI, 2023 (Baidu). [Paper][PyTorch]
    • AIM: "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 (Amazon). [Paper][PyTorch][Website]
    • ViFi-CLIP: "Fine-tuned CLIP Models are Efficient Video Learners", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
    • LaViLa: "Learning Video Representations from Large Language Models", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • TVP: "Text-Visual Prompting for Efficient 2D Temporal Video Grounding", CVPR, 2023 (Intel). [Paper]
    • Vita-CLIP: "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
    • STAN: "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring", CVPR, 2023 (Peking University). [Paper][PyTorch]
    • CBP-VLP: "Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
    • BIKE: "Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
    • HierVL: "HierVL: Learning Hierarchical Video-Language Embeddings", CVPR, 2023 (Meta). [Paper][PyTorch]
    • ?: "Test of Time: Instilling Video-Language Models with a Sense of Time", CVPR, 2023 (University of Amsterdam). [Paper][PyTorch][Website]
    • Open-VCLIP: "Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization", ICML, 2023 (Fudan). [Paper][PyTorch]
    • ILA: "Implicit Temporal Modeling with Learnable Alignment for Video Recognition", ICCV, 2023 (Fudan). [Paper][PyTorch]
    • OV2Seg: "Towards Open-Vocabulary Video Instance Segmentation", ICCV, 2023 (University of Amsterdam). [Paper][PyTorch]
    • DiST: "Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning", ICCV, 2023 (Alibaba). [Paper][PyTorch]
    • GAP: "Generative Action Description Prompts for Skeleton-based Action Recognition", ICCV, 2023 (Alibaba). [Paper][PyTorch]
    • MAXI: "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", ICCV, 2023 (Graz University of Technology, Austria). [Paper][PyTorch]
    • ?: "Language as the Medium: Multimodal Video Classification through text only", ICCVW, 2023 (Unitary, UK). [Paper]
    • MAP: "Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning", ACMMM, 2023 (Tencent). [Paper]
    • OTI: "Orthogonal Temporal Interpolation for Zero-Shot Video Recognition", ACMMM, 2023 (CAS). [Paper][Code (in construction)]
    • Symbol-LLM: "Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning", NeurIPS, 2023 (Shanghai Jiao Tong University (SJTU)). [Paper][Code (in construction)][Website]
    • OAP-AOP: "Opening the Vocabulary of Egocentric Actions", NeurIPS, 2023 (NUS). [Paper][PyTorch (in construction)][Website]
    • CLIP-FSAR: "CLIP-guided Prototype Modulating for Few-shot Action Recognition", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • ?: "Multi-modal Prompting for Low-Shot Temporal Action Localization", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • VicTR: "VicTR: Video-conditioned Text Representations for Activity Recognition", arXiv, 2023 (Google). [Paper]
    • OpenVIS: "OpenVIS: Open-vocabulary Video Instance Segmentation", arXiv, 2023 (Fudan). [Paper]
    • ALGO: "Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning", arXiv, 2023 (Oklahoma State University). [Paper]
    • ?: "Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features", arXiv, 2023 (Google). [Paper]
    • MSQNet: "MSQNet: Actor-agnostic Action Recognition with Multi-modal Query", arXiv, 2023 (University of Surrey, England). [Paper][Code (in construction)]
    • AVION: "Training a Large Video Model on a Single Machine in a Day", arXiv, 2023 (UT Austin). [Paper][PyTorch]
    • Open-VCLIP: "Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data", arXiv, 2023 (Fudan). [Paper][PyTorch]
    • Videoprompter: "Videoprompter: an ensemble of foundational models for zero-shot video understanding", arXiv, 2023 (UCF). [Paper]
    • MM-VID: "MM-VID: Advancing Video Understanding with GPT-4V(vision)", arXiv, 2023 (Microsoft). [Paper][Website]
    • Chat-UniVi: "Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding", arXiv, 2023 (Peking). [Paper]
    • Side4Video: "Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
    • ALT: "Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition", arXiv, 2023 (Huawei). [Paper]
    • MM-Narrator: "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning", arXiv, 2023 (Microsoft). [Paper][Website]
    • Spacewalk-18: "Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains", arXiv, 2023 (Brown). [Paper][Website]
    • OST: "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition", arXiv, 2023 (Hunan University (HNU)). [Paper][Code (in construction)][Website]
    • AP-CLIP: "Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition", arXiv, 2023 (Xi'an Jiaotong). [Paper]
    • EZ-CLIP: "EZ-CLIP: Efficient Zeroshot Video Action Recognition", arXiv, 2023 (Østfold University College, Norway). [Paper][PyTorch (in construction)]
    • M2-CLIP: "M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition", AAAI, 2024 (Zhejiang). [Paper]
    • FROSTER: "FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition", ICLR, 2024 (Baidu). [Paper][PyTorch][Website]
    • LaIAR: "Language Model Guided Interpretable Video Action Reasoning", CVPR, 2024 (Xidian University). [Paper][Code (in construction)]
    • BriVIS: "Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation", arXiv, 2024 (Peking). [Paper][PyTorch (in construction)]
    • ActionHub: "ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition", arXiv, 2024 (Sun Yat-sen University). [Paper]
    • ZERO: "Zero Shot Open-ended Video Inference", arXiv, 2024 (A*STAR). [Paper]
    • SATA: "Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition", arXiv, 2024 (Sun Yat-sen University). [Paper][Code (in construction)]
    • CLIP-VIS: "CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
  • X-supervised Learning:
    • LSTCL: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (Facebook). [Paper]
    • SVT: "Self-supervised Video Transformer", CVPR, 2022 (Stony Brook). [Paper][PyTorch][Website]
    • BEVT: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • SCVRL: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (Amazon). [Paper]
    • VIMPAC: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (UNC). [Paper][PyTorch]
    • ?: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (CUHK). [Paper]
    • VideoMAE: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", NeurIPS, 2022 (Tencent). [Paper][Pytorch]
    • MAE-ST: "Masked Autoencoders As Spatiotemporal Learners", NeurIPS, 2022 (Meta). [Paper][PyTorch]
    • ?: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (Georgia Tech). [Paper]
    • MaskViT: "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 (Stanford). [Paper][Code (in construction)][Website]
    • WeakSVR: "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos", CVPR, 2023 (ShanghaiTech). [Paper][PyTorch]
    • VideoMAE-V2: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • SVFormer: "SVFormer: Semi-supervised Video Transformer for Action Recognition", CVPR, 2023 (Fudan University). [Paper][PyTorch]
    • OmniMAE: "OmniMAE: Single Model Masked Pretraining on Images and Videos", CVPR, 2023 (Meta). [Paper][PyTorch]
    • MVD: "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", CVPR, 2023 (Fudan Univeristy). [Paper][PyTorch]
    • MME: "Masked Motion Encoding for Self-Supervised Video Representation Learning", CVPR, 2023 (South China University of Technology). [Paper][PyTorch]
    • MGMAE: "MGMAE: Motion Guided Masking for Video Masked Autoencoding", ICCV, 2023 (Shanghai AI Lab). [Paper]
    • MGM: "Motion-Guided Masking for Spatiotemporal Representation Learning", ICCV, 2023 (Amazon). [Paper]
    • TimeT: "Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations", ICCV, 2023 (UvA). [Paper][PyTorch]
    • LSS: "Language-based Action Concept Spaces Improve Video Self-Supervised Learning", NeurIPS, 2023 (Stony Brook). [Paper]
    • VITO: "Self-supervised video pretraining yields human-aligned visual representations", NeurIPS, 2023 (DeepMind). [Paper]
    • SiamMAE: "Siamese Masked Autoencoders", NeurIPS, 2023 (Stanford). [Paper][Website]
    • ViC-MAE: "Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders", arXiv, 2023 (Rice University). [Paper]
    • LSTA: "Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation", arXiv, 2023 (Hangzhou Dianzi University). [Paper]
    • DoRA: "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video", arXiv, 2023 (INRIA). [Paper]
    • AMD: "Asymmetric Masked Distillation for Pre-Training Small Foundation Models", arXiv, 2023 (Nanjing University). [Paper]
    • SSL-UVOS: "Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation", arXiv, 2023 (CUHK). [Paper]
    • NMS: "No More Shortcuts: Realizing the Potential of Temporal Self-Supervision", AAAI, 2024 (Adobe). [Paper][Website]
    • VideoMAC: "VideoMAC: Video Masked Autoencoders Meet ConvNets", CVPR, 2024 (Nanjing University of Science and Technology). [Paper]
    • GPM: "Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention", arXiv, 2024 (HKUST). [Paper]
    • MV2MAE: "MV2MAE: Multi-View Video Masked Autoencoders", arXiv, 2024 (Amazon). [Paper][PyTorch]
    • V-JEPA: "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv, 2024 (Meta). [Paper][PyTorch][Website]
  • Transfer Learning/Adaptation:
    • APT: "Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling", FG, 2024 (JHU). [Paper][PyTorch]
  • X-shot:
    • ResT: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (Microsoft). [Paper]
    • ViSET: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (University of South Florida). [Paper]
    • REST: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (Samsung). [Paper]
    • MoLo: "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition", CVPR, 2023 (Alibaba). [Paper][Code (in construction)]
    • MA-CLIP: "Multimodal Adaptation of CLIP for Few-Shot Action Recognition", arXiv, 2023 (Zhejiang). [Paper]
    • SA-CT: "On the Importance of Spatial Relations for Few-shot Action Recognition", arXiv, 2023 (Fudan). [Paper]
    • CapFSAR: "Few-shot Action Recognition with Captioning Foundation Models", arXiv, 2023 (Alibaba). [Paper]
  • Multi-Task:
    • EgoPack: "A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives", CVPR, 2024 (Politecnico di Torino, Italy). [Paper][PyTorch (in construction)][Website]
  • Anomaly Detection:
    • CT-D2GAN: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (NEC). [Paper]
    • ADTR: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (Shanghai Jiao Tong University). [Paper]
    • SSMCTB: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (UCF). [Paper][Code (in construction)]
    • ?: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 (Korea University). [Paper]
    • CLIP-TSA: "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", ICIP, 2023 (University of Arkansas). [Paper]
    • ?: "Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features", CVPR, 2023 (Konica Minolta, Japan). [Paper]
    • TPWNG: "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection", CVPR, 2024 (Xidian University). [Paper]
  • Relation Detection:
    • VidVRD: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (Zhejiang University). [Paper][PyTorch]
    • VRDFormer: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (Renmin University of China). [Paper][Code (in construction)]
    • VidSGG-BIG: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
    • RePro: "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 (Zhejiang University). [Paper][PyTorch (in construction)]
  • Saliency Prediction:
    • STSANet: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (Shanghai University). [Paper]
    • UFO: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (South China University of Technology). [Paper][PyTorch]
    • DMT: "Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection", CVPR, 2023 (Northwestern Polytechnical University). [Paper][PyTorch]
    • CASP-Net: "CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective", CVPR, 2023 (Northwestern Polytechnical University). [Paper]
  • Video Inpainting Detection:
    • FAST: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (Tsinghua University). [Paper]
  • Driver Activity:
    • TransDARC: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
    • ?: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (Jericho High School, NY). [Paper]
    • ViT-DD: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (Purdue). [Paper][PyTorch (in construction)]
  • Video Alignment:
    • DGWT: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (University of New South Wales, Australia). [Paper]
  • Sport-related:
    • Skating-Mixer: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (Southern University of Science and Technology). [Paper]
  • Action Counting:
    • TransRAC: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 (ShanghaiTech). [Paper][PyTorch][Website]
    • PoseRAC: "PoseRAC: Pose Saliency Transformer for Repetitive Action Counting", arXiv, 2023 (Peking University). [Paper][PyTorch]
  • Action Quality Assessment:
    • ?: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 (Baidu). [Paper]
    • ?: "Action Quality Assessment using Transformers", arXiv, 2022 (USC). [Paper]
  • Human Interaction:
    • IGFormer: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (The University of Melbourne). [Paper]
  • Cross-Domain:
    • UDAVT: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (University of Trento). [Paper][Code (in construction)]
    • AutoLabel: "AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation", CVPR, 2023 (University of Trento). [Paper][PyTorch]
    • DALL-V: "The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation", ICCV, 2023 (University of Trento). [Paper][PyTorch]
  • Multi-Camera Editing:
    • TC-Transformer: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (CUHK). [Paper]
  • Instructional/Procedural Video:
    • ProcedureVRL: "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations", CVPR, 2023 (Meta). [Paper]
    • Paprika: "Procedure-Aware Pretraining for Instructional Video Understanding", CVPR, 2023 (Salesforce). [Paper][PyTorch]
    • StepFormer: "StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos", CVPR, 2023 (Samsung). [Paper]
    • E3P: "Event-Guided Procedure Planning from Instructional Videos with Text Supervision", ICCV, 2023 (Sun Yat-sen University). [Paper]
    • VLaMP: "Pretrained Language Models as Visual Planners for Human Assistance", ICCV, 2023 (Meta). [Paper]
    • VINA: "Learning to Ground Instructional Articles in Videos through Narrations", ICCV, 2023 (Meta). [Paper][Website]
    • PREGO: "PREGO: online mistake detection in PRocedural EGOcentric videos", CVPR, 2024 (Sapienza University of Rome, Italy). [Paper][Code (in construction)]
  • Continual Learning:
    • PIVOT: "PIVOT: Prompting for Video Continual Learning", CVPR, 2023 (KAUST). [Paper]
  • 3D:
    • MaST-Pre: "Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos", ICCV, 2023 (CloudWalk, China). [Paper][PyTorch]
    • EPIC-Fields: "EPIC Fields: Marrying 3D Geometry and Video Understanding", NeurIPS, 2023 (Oxford + Bristol). [Paper][Website]
  • Audio-Video:
    • AVGN: "Audio-Visual Glance Network for Efficient Video Recognition", ICCV, 2023 (KAIST). [Paper]
  • Event Camera:
    • EventTransAct: "EventTransAct: A video transformer-based framework for Event-camera based action recognition", IROS, 2023 (UCF). [Paper][PyTorch][Website]
  • Long Video:
    • EgoSchema: "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
    • KTS: "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding", arXiv, 2023 (Meta). [Paper]
    • TCR: "Text-Conditioned Resampler For Long Form Video Understanding", arXiv, 2023 (Google). [Paper]
    • MC-ViT: "Memory Consolidation Enables Long-Context Video Understanding", arXiv, 2024 (DeepMind). [Paper]
    • VideoAgent: "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", arXiv, 2024 (Stanford). [Paper]
  • Video Story:
    • YouTube-News-Timeline: "Video Timeline Modeling For News Story Understanding", NeurIPS (Datasets and Benchmarks), 2023 (Google). [Paper][GotHub]
  • Analysis:
    • VTCD: "Understanding Video Transformers via Universal Concept Discovery", arXiv, 2024 (Toyota). [Paper][Website]

[Back to Overview]


References