ManagerTower

This repo is the official Pytorch implementation of the paper:

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan.

ACL 2023 (Oral) | Association for Computational Linguistics

Abstract

Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.

Architecture

BridgeTower vs. ManagerTower

Main Results

Visualization

Deployment

Run setup.sh to set up the environment.
[Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

We follow ViLT and use pyarrow to serialize the datasets. See here for details.
For SNLI-VE dataset, we follow here.
For VG-QA dataset, except the image-text pairs in VG got from here, image meta data, question answers data and coco split information also need to be downloaded.
The final file structure of datasets are shown in setup.sh.

Checkpoints

We provide the following checkpoints for reproducing our results. You can download them from here.

Pre-trained checkpoints on 4M data
Fine-tuned checkpoints for

Pre-training on Image-Text Datasets

# Pre-train ManagerTower Base Model
bash scripts/pre_train.sh

Fine-tuning on Downstream VL Tasks

VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.

# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_flickr.sh

Citation

@article{xu2023managertower,
  title={ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning},
  author={Xu, Xiao and Li, Bei and Wu, Chenfei and Tseng, Shao-Yen and Bhiwandiwalla, Anahita and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2306.00103},
  year={2023}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them:

Main Code: BridgeTower (which is highly based on ViLT and METER)
Others: CLIP, ALBEF, BLIP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ManagerTower

Abstract

Architecture

BridgeTower vs. ManagerTower

Main Results

Visualization

Deployment

Dataset Preparation

Checkpoints

Pre-training on Image-Text Datasets

Fine-tuning on Downstream VL Tasks

Citation

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

ManagerTower

Abstract

Architecture

BridgeTower vs. ManagerTower

Main Results

Visualization

Deployment

Dataset Preparation

Checkpoints

Pre-training on Image-Text Datasets

Fine-tuning on Downstream VL Tasks

Citation

Acknowledgement