Skip to content

Latest commit

 

History

History
101 lines (66 loc) · 5.99 KB

README.md

File metadata and controls

101 lines (66 loc) · 5.99 KB

ManagerTower

This repo is the official Pytorch implementation of the paper:

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan.

ACL 2023 (Oral) | Association for Computational Linguistics

Paper | Arxiv | Model | Slides | Video(EN) | Video(CN) | Blog(CN) | Tweet(EN)

Abstract

Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.

Architecture

Architecture

BridgeTower vs. ManagerTower

Comparison

Main Results

Result

Visualization

Visualization

Deployment

  • Run setup.sh to set up the environment.
  • [Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

Checkpoints

We provide the following checkpoints for reproducing our results. You can download them from here.

Pre-training on Image-Text Datasets

# Pre-train ManagerTower Base Model
bash scripts/pre_train.sh

Fine-tuning on Downstream VL Tasks

  • VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_flickr.sh

Citation

@article{xu2023managertower,
  title={ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning},
  author={Xu, Xiao and Li, Bei and Wu, Chenfei and Tseng, Shao-Yen and Bhiwandiwalla, Anahita and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2306.00103},
  year={2023}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them: