Code Version 1

VITA-Group · Dec 11, 2021 · 5d93306 · 5d93306
1 parent 702cd76
commit 5d93306
Show file tree

Hide file tree

Showing 67 changed files with 8,505 additions and 2 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/Figs/.DS_Store b/Figs/.DS_Store
diff --git a/Figs/architecture.png b/Figs/architecture.png
diff --git a/LICENSE b/LICENSE
@@ -18,4 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1 +1,146 @@
-# UVC
+# Unified Vision Transformer Compression
+
+[![License: MIT](https://camo.githubusercontent.com/fd551ba4b042d89480347a0e74e31af63b356b2cac1116c7b80038f41b04a581/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d677265656e2e737667)](https://opensource.org/licenses/MIT)
+
+Codes for the paper: [Unified Vision Transformer Compression](https://openreview.net/pdf?id=9jsZiUgkCZP).
+
+Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, Zhangyang Wang
+
+
+
+## Overall Results
+
+Extensive experiments are conducted with several DeiT backbones on ImageNet, which consistently verify the effectiveness of our proposal. For example, UVC on DeiT-Tiny (with/without distillation tokens) yields around 50% FLOPs reduction, with little performance degradation (only 0.3%/0.9% loss compared to the baseline).
+
+| Method         | Acc           | FLOPs(G) | Compression Ratio (%) |
+| :------------- | :------------ | :------- | :-------------------- |
+| DeiT-Small     | 79.8          | 4.6      | 100                   |
+| SCOP           | 77.5 (-2.3)   | 2.6      | 56.4                  |
+| PoWER          | 78.3 (-1.5)   | 2.7      | 58.7                  |
+| HVT            | 78.0 (-1.8)   | 2.4      | 52.2                  |
+| Patch Slimming | 79.4 (-0.4)   | 2.6      | 56.5                  |
+| UVC (Ours)     | 79.44 (-0.36) | 2.65     | 57.61                 |
+| UVC (Ours)     | 78.82 (-0.98) | 2.32     | 50.41                 |
+
+
+
+## Overview of Proposed UVC
+
+We formulate and solve UVC as a unified constrained optimization problem. It simultaneously learns model weights, layer-wise pruning ratios/masks, and skip configurations, under a distillation loss and an overall budget constraint.
+
+![architecture](Figs/architecture.png)
+
+
+
+## Implementations of UVC
+
+### Set the Environment
+
+```bash
+conda create -n vit python=3.6
+
+pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
+
+pip install tqdm scipy timm
+pip install ml_collections
+pip install tensorboard
+
+git clone https://github.com/NVIDIA/apex
+
+cd apex
+
+pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+
+pip install -v --disable-pip-version-check --no-cache-dir ./
+```
+
+
+
+### Running command
+
+The training contains two parts. 
+
+* The first part is **UVC Training**. In this stage, it optimizes the architecture with primal-dual algorithm to find the optimal block-wise layout and skip configuration. 
+* The second part is **Post Training**. In this stage, the architecture is fixed while only updating the weights to help the network to regain accuracy.
+
+#### Stage1 UVC Training
+
+```bash
+python -W ignore -m torch.distributed.launch \
+--nproc_per_node=2 \
+--master_port 6019 joint_train.py \
+--gpu_num '0,1' \
+--uvc_train \
+--model_type deit_tiny_patch16_224 \
+--model_path https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth \
+--distillation-type soft \
+--distillation-alpha 0.1 \
+--train_batch_size 512 \
+--num_epochs 30 \
+--eval_every 1000 \
+--flops_with_mhsa 1 \
+--zlr_schedule_list "1,5,9,13,17" \
+--learning_rate 1e-4 \
+--enable_deit 0 \
+--budget 0.5 \
+--enable_pruning 1 \
+--enable_block_gating 1 \
+--enable_patch_gating 1 \
+--gating_weight 5e-4 \
+--patch_weight 5 \
+--patch_l1_weight 0.01 \
+--patchloss "l1" \
+--use_gumbel 1 \
+--glr 0.1 \
+--patchlr 0.01 \
+--num_workers 64 \
+--seed 730 \
+--output_dir mc_deit_tiny_patch16_224_with_patch \
+--log_interval 1000 \
+--eps 0.1 \
+--eps_decay 0.92 \
+--enable_warmup 1 \
+--warmup_epochs 5 \
+--warmup_lr 1e-4 \
+--z_grad_clip 0.5 \
+--gating_interval 50
+```
+
+#### Stage2 Post Training
+
+```bash
+python -m torch.distributed.launch \
+--nproc_per_node=2 --master_port 6382 post_train.py \
+--pretrained 0 \
+--model_type "deit_small_patch16_224" \
+--model_path https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth \
+--checkpoint_dir /home/shixing/deit_small_patch16_224_11.pth.tar \
+--distillation-type soft \
+--distillation-alpha 0.1 \
+--train_batch_size 256 \
+--gpu_num '2,3' \
+--epochs 120 \
+--eval_every 1000 \
+--output_dir exp/deit_small_nasprune_0.58 \
+--num_workers 64
+```
+
+
+
+## Citation
+
+```
+TBD
+```
+
+
+
+## Acknowledgement
+
+ViT : https://github.com/jeonsworld/ViT-pytorch
+
+ViT : https://github.com/google-research/vision_transformer
+
+DeiT: https://github.com/facebookresearch/deit
+
+T2T-ViT: https://github.com/yitu-opensource/T2T-ViT
diff --git a/UVC/.DS_Store b/UVC/.DS_Store
diff --git a/UVC/T2TViT/.DS_Store b/UVC/T2TViT/.DS_Store
diff --git a/UVC/T2TViT/LICENSE b/UVC/T2TViT/LICENSE
@@ -0,0 +1,12 @@
+The Clear BSD License
+
+Copyright (c) [2012]-[2021] Shanghai Yitu Technology Co., Ltd.
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification, are permitted (subject to the limitations in the disclaimer below) provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+* Neither the name of Shanghai Yitu Technology Co., Ltd. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY SHANGHAI YITU TECHNOLOGY CO., LTD. AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SHANGHAI YITU TECHNOLOGY CO., LTD. OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/UVC/T2TViT/README.md b/UVC/T2TViT/README.md
@@ -0,0 +1,200 @@
+# Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [arxiv](https://arxiv.org/abs/2101.11986)
+
+### Update:
+2021/03/11: update our new results. Now our T2T-ViT-14 with 21.5M parameters can reach 81.5% top1-acc with 224x224 image resolution, and 83.3\% top1-acc with 384x384 resolution. 
+
+2021/02/21: T2T-ViT can be trained on most of common GPUs: 1080Ti, 2080Ti, TiTAN V, V100 stably with '--amp' (Automatic Mixed Precision). In some specifical GPU like Tesla T4, 'amp' would cause NAN loss when training T2T-ViT. If you get NAN loss in training, you can disable amp by removing '--amp' in the [training scripts](https://github.com/yitu-opensource/T2T-ViT#train).
+
+2021/01/28: release codes and upload most of the pretrained models of T2T-ViT.
+
+<p align="center">
+<img src="https://github.com/yitu-opensource/T2T-ViT/blob/main/images/f1.png">
+</p>
+
+## Reference
+If you find this repo useful, please consider citing:
+```
+@article{yuan2021tokens,
+  title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
+  author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
+  journal={arXiv preprint arXiv:2101.11986},
+  year={2021}
+}
+```
+
+Our codes are based on the [official imagenet example](https://github.com/pytorch/examples/tree/master/imagenet) by [PyTorch](https://pytorch.org/) and [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) by [Ross Wightman](https://github.com/rwightman)
+
+
+## 1. Requirements
+
+[timm](https://github.com/rwightman/pytorch-image-models), pip install timm==0.3.4
+
+torch>=1.4.0
+
+torchvision>=0.5.0
+
+pyyaml
+
+data prepare: ImageNet with the following folder structure, you can extract imagenet by this [script](https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4).
+
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## 2. T2T-ViT Models
+
+
+| Model    | T2T Transformer | Top1 Acc | #params | MACs |  Download|
+| :---     |   :---:         |  :---:   |  :---:  | :---: |  :---:   | 
+| T2T-ViT-14   |  Performer  |   81.5   |  21.5M  | 4.8G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.5_T2T_ViT_14.pth.tar)| 
+| T2T-ViT-19   |  Performer  |   81.9   |  39.2M  | 8.5G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.9_T2T_ViT_19.pth.tar)| 
+| T2T-ViT-24   |  Performer  |   82.3   |  64.1M  | 13.8G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/82.3_T2T_ViT_24.pth.tar)| 
+| T2T-ViT-14, 384   |  Performer  |   83.3   |  21.7M  |   | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/83.3_T2T_ViT_14.pth.tar)|
+| T2T-ViT-24, Token Labeling   |  Performer  |   84.2   |  65M  |   | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/84.2_T2T_ViT_24.pth.tar)| 
+| T2T-ViT_t-14 | Transformer |   81.7   |  21.5M  | 6.1G | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.7_T2T_ViTt_14.pth.tar)  | 
+| T2T-ViT_t-19 | Transformer |   82.4   |  39.2M  | 9.8G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/82.4_T2T_ViTt_19.pth.tar) | 
+| T2T-ViT_t-24 | Transformer |   82.6   |  64.1M  | 15.0G| [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/82.6_T2T_ViTt_24.pth.tar) | 
+
+The 'T2T-ViT-14, 384' means we train T2T-ViT-14 with image size of 384 x 384.
+
+The 'T2T-ViT-24, Token Labeling' means we train T2T-ViT-24 with [Token Labeling](https://github.com/zihangJiang/TokenLabeling).
+
+The three lite variants of T2T-ViT (Comparing with MobileNets):
+| Model    | T2T Transformer | Top1 Acc | #params | MACs |  Download|
+| :---     |   :---:         |  :---:   |  :---:  | :---: |  :---:   | 
+| T2T-ViT-7   |  Performer  |   71.7   |  4.3M   | 1.1G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/71.7_T2T_ViT_7.pth.tar)| 
+| T2T-ViT-10   |  Performer  |   75.2   |  5.9M   | 1.5G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/75.2_T2T_ViT_10.pth.tar)| 
+| T2T-ViT-12   |  Performer  |   76.5   |  6.9M   | 1.8G  | [here](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/76.5_T2T_ViT_12.pth.tar)  |
+
+
+### Usage
+The way to use our pretrained T2T-ViT:
+```
+from models.t2t_vit import *
+from utils import load_for_transfer_learning 
+
+# create model
+model = t2t_vit_14()
+
+# load the pretrained weights
+load_for_transfer_learning(model, /path/to/pretrained/weights, use_ema=True, strict=False, num_classes=1000)  # change num_classes based on dataset, can work for different image size as we interpolate the position embeding for different image size.
+```
+
+
+## 3. Validation
+
+Test the T2T-ViT-14 (take Performer in T2T module),
+
+Download the [T2T-ViT-14](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.5_T2T_ViT_14.pth.tar), then test it by running:
+
+```
+CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model t2t_vit_14 -b 100 --eval_checkpoint path/to/checkpoint
+```
+The results look like:
+
+```
+Test: [   0/499]  Time: 2.083 (2.083)  Loss:  0.3578 (0.3578)  Acc@1: 96.0000 (96.0000)  Acc@5: 99.0000 (99.0000)
+Test: [  50/499]  Time: 0.166 (0.202)  Loss:  0.5823 (0.6404)  Acc@1: 85.0000 (86.1569)  Acc@5: 99.0000 (97.5098)
+...
+Test: [ 499/499]  Time: 0.272 (0.172)  Loss:  1.3983 (0.8261)  Acc@1: 62.0000 (81.5000)  Acc@5: 93.0000 (95.6660)
+Top-1 accuracy of the model is: 81.5%
+
+```
+
+Test the three lite variants: T2T-ViT-7, T2T-ViT-10, T2T-ViT-12 (take Performer in T2T module),
+
+Download the [T2T-ViT-7](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/71.7_T2T_ViT_7.pth.tar), [T2T-ViT-10](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/75.2_T2T_ViT_10.pth.tar) or [T2T-ViT-12](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/76.5_T2T_ViT_12.pth.tar), then test it by running:
+
+```
+CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model t2t_vit_7 -b 100 --eval_checkpoint path/to/checkpoint
+```
+
+Test the model T2T-ViT-14, 384 with 83.3\% top-1 accuracy: 
+```
+CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model t2t_vit_14 --img-size 384 -b 100 --eval_checkpoint path/to/T2T-ViT-14-384 
+```
+
+
+## 4. Train
+
+Train the three lite variants: T2T-ViT-7, T2T-ViT-10 and T2T-ViT-12 (take Performer in T2T module):
+
+If only 4 GPUs are available,
+
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 ./distributed_train.sh 4 path/to/data --model t2t_vit_7 -b 128 --lr 1e-3 --weight-decay .03 --amp --img-size 224
+```
+
+The top1-acc in 4 GPUs would be slightly lower than 8 GPUs (around 0.1%-0.3% lower).
+
+If 8 GPUs are available: 
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model t2t_vit_7 -b 64 --lr 1e-3 --weight-decay .03 --amp --img-size 224
+```
+
+
+Train the T2T-ViT-14 and T2T-ViT_t-14 (run on 4 or 8 GPUs):
+
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 ./distributed_train.sh 4 path/to/data --model t2t_vit_14 -b 128 --lr 1e-3 --weight-decay .05 --amp --img-size 224
+```
+
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model t2t_vit_14 -b 64 --lr 5e-4 --weight-decay .05 --amp --img-size 224
+```
+If you want to train our T2T-ViT on images with 384x384 resolution, please use '--img-size 384'.
+
+
+Train the T2T-ViT-19, T2T-ViT-24 or T2T-ViT_t-19, T2T-ViT_t-24:
+
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model t2t_vit_19 -b 64 --lr 5e-4 --weight-decay .065 --amp --img-size 224
+```
+
+## 5. Transfer T2T-ViT to CIFAR10/CIFAR100
+
+| Model        |  ImageNet | CIFAR10 |  CIFAR100| #params| 
+| :---         |    :---:  | :---:   |  :---:   |   :---:  | 
+| T2T-ViT-14   |   81.5    |[98.3](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/cifar10_t2t-vit_14_98.3.pth)  | [88.4](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/cirfar100_t2t-vit-14_88.4.pth) | 21.5M    | 
+| T2T-ViT-19   |   81.9    |[98.4](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/cifar10_t2t-vit_19_98.4.pth)  | [89.0](https://github.com/yitu-opensource/T2T-ViT/releases/download/main/cifar100_t2t-vit-19_89.0.pth) |39.2M     | 
+
+We resize CIFAR10/100 to 224x224 and finetune our pretrained T2T-ViT-14/19 to CIFAR10/100 by running:
+
+```
+CUDA_VISIBLE_DEVICES=0,1 transfer_learning.py --lr 0.05 --b 64 --num-classes 10 --img-size 224 --transfer-learning True --transfer-model /path/to/pretrained/T2T-ViT-19
+```
+
+## 6. Visualization
+
+Visualize the image features of ResNet50, you can open and run the [visualization_resnet.ipynb](https://github.com/yitu-opensource/T2T-ViT/blob/main/visualization_resnet.ipynb) file in jupyter notebook or jupyter lab; some results are given as following:
+
+<p align="center">
+<img src="https://github.com/yitu-opensource/T2T-ViT/blob/main/images/resnet_conv1.png" width="600" height="300"/>
+</p>
+
+Visualize the image features of ViT, you can open and run the [visualization_vit.ipynb](https://github.com/yitu-opensource/T2T-ViT/blob/main/visualization_vit.ipynb) file in jupyter notebook or jupyter lab; some results are given as following:
+
+<p align="center">
+<img src="https://github.com/yitu-opensource/T2T-ViT/blob/main/images/vit_block1.png" width="600" height="300"/>
+</p>
+
+Visualize attention map, you can refer to this [file](https://github.com/jeonsworld/ViT-pytorch/blob/main/visualize_attention_map.ipynb). A simple example by visualizing the attention map in attention block 4 and 5 is:
+
+
+<p align="center">
+<img src="https://github.com/yitu-opensource/T2T-ViT/blob/main/images/attention_visualization.png" width="600" height="400"/>
+</p>
+
+
diff --git a/UVC/T2TViT/__pycache__/utils.cpython-38.pyc b/UVC/T2TViT/__pycache__/utils.cpython-38.pyc
diff --git a/UVC/T2TViT/distributed_train.sh b/UVC/T2TViT/distributed_train.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+# Copyright (c) [2012]-[2021] Shanghai Yitu Technology Co., Ltd.
+#
+# This source code is licensed under the Clear BSD License
+# LICENSE file in the root directory of this file
+# All rights reserved.
+NUM_PROC=$1
+shift
+python -m torch.distributed.launch --nproc_per_node=$NUM_PROC main.py "$@"
+
diff --git a/UVC/T2TViT/images/attention_visualization.png b/UVC/T2TViT/images/attention_visualization.png
diff --git a/UVC/T2TViT/images/dog.png b/UVC/T2TViT/images/dog.png
diff --git a/UVC/T2TViT/images/f1.png b/UVC/T2TViT/images/f1.png
diff --git a/UVC/T2TViT/images/resnet_conv1.png b/UVC/T2TViT/images/resnet_conv1.png
diff --git a/UVC/T2TViT/images/vit_block1.png b/UVC/T2TViT/images/vit_block1.png