Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CUDA 11 and CUDA 10 with some Clean Up #687

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

KleinYuan
Copy link
Contributor

A few issues occur:

  1. the docker cannot run on CUDA11, aka, all the Amphere arch GPUs, like 3070, 3080, ...
  2. the documented docker run has issues: /bin/sh exe will make pip not available

This PR is fully tested on a 3070 machine, we can run training:

[01/27 22:31:39 fastreid.utils.checkpoint]: No checkpoint found. Training model from scratch
[01/27 22:31:39 fastreid.engine.train_loop]: Starting training from epoch 0
[01/27 22:32:24 fastreid.utils.events]:  eta: 1:21:55  epoch/iter: 0/199  total_loss: 7.745  loss_cls: 6.461  loss_triplet: 1.292  time: 0.2043  data_time: 0.0013  lr: 6.60e-05  max_mem: 4862M
[01/27 22:32:24 fastreid.utils.events]:  eta: 1:21:55  epoch/iter: 0/201  total_loss: 7.726  loss_cls: 6.445  loss_triplet: 1.26  time: 0.2043  data_time: 0.0010  lr: 6.63e-05  max_mem: 4862M
[01/27 22:33:08 fastreid.utils.events]:  eta: 1:23:00  epoch/iter: 1/399  total_loss: 5.311  loss_cls: 4.884  loss_triplet: 0.4171  time: 0.2082  data_time: 0.0010  lr: 9.75e-05  max_mem: 4862M
[01/27 22:33:09 fastreid.utils.events]:  eta: 1:23:00  epoch/iter: 1/403  total_loss: 5.273  loss_cls: 4.852  loss_triplet: 0.4111  time: 0.2085  data_time: 0.0010  lr: 9.82e-05  max_mem: 4862M
[01/27 22:33:58 fastreid.utils.events]:  eta: 1:23:21  epoch/iter: 2/599  total_loss: 3.677  loss_cls: 3.44  loss_triplet: 0.227  time: 0.2194  data_time: 0.0007  lr: 1.29e-04  max_mem: 4862M

It includes the following changes:

  1. add a CUDA 11 docker file
  2. move the dockerfile to the root folder
  3. update the docker command documentation
  4. remove the user management -- not necessary

RUN pip install tensorboard cmake # cmake from apt-get is too old
RUN pip install torch==1.10 torchvision==0.11.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
# RUN pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/cu101/torch_stable.html
# RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboard opencv-python cython yacs termcolor scikit-learn tabulate gdown gpustat faiss-gpu ipdb h5py
Copy link
Contributor Author

@KleinYuan KleinYuan Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed https://pypi.tuna.tsinghua.edu.cn/simple, due to constant timeout. Do we need this ? @L1aoXingyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant