ernie-vil

History

Name	Name	Last commit message	Last commit date
parent directory ..
.meta	.meta	basic code	May 20, 2021
args	args	basic code	May 20, 2021
bash	bash	basic code	May 20, 2021
batching	batching	basic code	May 20, 2021
conf	conf	basic code	May 20, 2021
model	model	basic code	May 20, 2021
optim	optim	basic code	May 20, 2021
preprocess	preprocess	basic code	May 20, 2021
reader	reader	basic code	May 20, 2021
utils	utils	basic code	May 20, 2021
README.md	README.md	basic code	May 20, 2021
README_zh.md	README_zh.md	basic code	May 20, 2021
finetune.py	finetune.py	basic code	May 20, 2021
requirements.txt	requirements.txt	basic code	May 20, 2021
run_finetuning.sh	run_finetuning.sh	basic code	May 20, 2021
run_inference.sh	run_inference.sh	basic code	May 20, 2021

README.md

English| 简体中文

PREFACE (Muennighoff)

This repository has been copied from https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil and adapted for the following:

Updated to Python 3 from Python 2
HM task added
TSV-reading added
Slight changes in the shell scripts to allow custom loading & stopping
Slight changes in the training scripts

Below follows the original README.

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

Framework
Pre-trained models
Downstream tasks
- VCR
Usage
Citation

For technical description of the algorithm, please see our paper:

ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph

Fei Yu*, Jiji Tang*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (* : equal contribution)

Preprint June 2020

ERNIE-ViL is a knowledge-enhanced joint representations for vision-language tasks, which is the first work that has introduced structured knowledge to enhance vision-language pre-training. Utilizing structured knowledge obtained from scene graphs, ERNIE-ViL constructs three Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks. Thus, ERNIE-ViL can learn the better joint vision-language representations characterizing the alignments of the detailed semantics across vision and language.

Framework

Based on the scene graph parsed from the text using Scene Graph Parser, we construct Object Prediction, Attribute Prediction and Relationship Prediction tasks:

Object Prediction: We randomly select a set of the objects in the scene graph, then mask and predict the corresponding words in the sentence.
Attribute Prediction: For the object-attribute pairs in the scene graph, we randomly select a part of them to mask and predict the words related to the attribute nodes in the sentence.
Realtionship Prediction: For the object-relationship-object triplets in the scene graph, we randomly select a part of realtionship nodes to mask and predict them.

Model Architecture of ERNIE-ViL

Pre-trained Models

ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on Conceptual Captions and SBU Captions.

ERNIE-ViL base (lowercased | 12-text-stream-layer, 6-visual-stream-layer)
ERNIE-ViL large (lowercased | 24-text-stream-layer, 6-visual-stream-layer)

Downstream tasks

We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning(VCR)， Visual Question Answering(VQA), Cross-modal Image Retrieval(IR), Cross-modal Text Retrieval(TR) and Region_to_Phrase_Grounding(RefCOCO+).

Code and pre-trained models related to VCR task are made public now, and those of more downstream tasks are planed to be public.

VCR

datasets
- The training, validation and testing data of VCR task are provided by VCR Website.
- Organization of visual features is modified from ViLBERT, we directly use the data from it. Data can be downloaded here.
- Put all downloaded files under diretory "data/vcr".
Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows:
- ERNIE-ViL-VCR-task-pretrain base
- ERNIE-ViL-VCR-task-pretrain large

Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models(VILLA).

Models	Q->A	QA->R	Q->AR
VILLA (task-pretrain) base	75.54(76.4)	78.78(79.1)	59.75(60.6)
ERNIE-ViL (task-pretrain) base	76.37(77.0)	79.65(80.3)	61.24(62.1)
VILLA (task-pretrain) large	78.45(78.9)	82.57(82.8)	65.18(65.7)
ERNIE-ViL (task-pretrain) large	78.52(79.2)	83.37(83.5)	65.81(66.3)

Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively. Test results are obtained from the VCR leadborad.

Usage

Install PaddlePaddle

This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependencies of ERNIE-ViL are listed in requirements.txt, you can install them by

   pip install -r requirements.txt

Fine-tuning on ERNIE-ViL

Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through configuration files. For example, you can finetune ERNIE-ViL model on VCR task by

    sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models

Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs. If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr".

Inference

You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks:

Task Q->A

      sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file

Task QA->R

      sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file

Citation

You can cite the paper as below:

@article{yu2020ernie,
  title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
  author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2006.16934},
  year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

ernie-vil

ernie-vil

README.md

PREFACE (Muennighoff)

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

Framework

Pre-trained Models

Downstream tasks

VCR

Usage

Install PaddlePaddle

Fine-tuning on ERNIE-ViL

Inference

Citation

Files

ernie-vil

Directory actions

More options

Directory actions

More options

Latest commit

History

ernie-vil

Folders and files

parent directory

README.md

PREFACE (Muennighoff)

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

Framework

Pre-trained Models

Downstream tasks

VCR

Usage

Install PaddlePaddle

Fine-tuning on ERNIE-ViL

Inference

Citation