Skip to content

Latest commit

 

History

History
237 lines (175 loc) · 14.8 KB

readme_win10.md

File metadata and controls

237 lines (175 loc) · 14.8 KB

Running CAGroup3D in Winodws 10

ScreenCapture_2023-04-08-23-47-01.png Detection from a "finetuned" model solely in Windows 10 PC.

Environments

  • (Passed) Testing environment: X299 + i7-7820X + Win10 22H2 + RTX 2080 Ti + 72GB 128GB DDR4 + 480GB SATA SSD
  • (Passed) Another testing environment: C602 + 2x E5-2650V4 + Win10 22H1 + GTX 1080 Ti + 256GB DDR4 + 500GB SATA SSD
  • WSL2 does not work. CUDA crash hopelessly.

Objective

  • Train with joint dataset (ScanNetV2 + Sun RGB-D) and examine the result against both tasks.
  • Not focused on reproduce the data (obviously different CUDA version will produce different results)
  • Some live demo with Jupyter notebook

Before cloning this repo

# python=3.11 will crash in application!
conda create -n cagroup3d-env -c conda-forge scikit-learn python=3.10
conda activate cagroup3d-env

# Gamble on cu117 (nvidia-smi shows GTX 2080Ti + CUDA 12.1), as pytorch has cu117 also
pip install spconv-cu117

# Yea, need torch. Must be 1.13.1.
pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117

# Version conflict between numpy and numba
conda install -c conda-forge numba

# OMP: Error #15. Alternatively you can set a OS flag
#os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
conda install -c conda-forge nomkl

# Tensorboard
conda install -c conda-forge tensorboard

# For evalulation
conda install -c conda-forge terminaltables

# For "plan a" of data visualisation
conda install -c conda-forge matplotlib

# For "plan b" of data visualisation
conda install -c conda-forge mayavi
  • CPU only MinkowskiEngine is troublesome. Head to this git issue and download the windows package. If you're using non 3.10, you may need to manually make the package. WSL2 is not working.
pip install ninja open3d
pip install MinkowskiEngine-0.5.4-cp310-cp310-win_amd64.whl
# Should return True
python -c "import torch; print(torch.cuda.is_available())"
# Should also return True (GPU ME) and False (CPU ME)
python -c "import MinkowskiEngine as ME; from MinkowskiEngineBackend._C import is_cuda_available; print(is_cuda_available())"

After cloning this repo

  • Endless CPP debugging:
cd CAGroup3D
python setup.py develop > logs/pcdet.txt
  • Additional CUDA ops looks fine:
# rotate iou ops
cd CAGroup3D/pcdet/ops/rotated_iou/cuda_op
python setup.py install > ../../../../logs/cuda_ops_rotated_iou.txt
# knn ops
cd ../../knn
python setup.py develop > ../../../logs/cuda_ops_knn.txt

Dataset

  • Use prepared dataset.. DATA_PATH could be in full path. Currently placed as ../data/scannet_data/ScanNetV2 and ../data/sunrgbd_data/sunrgbd.

Training

  • Note that CUDA_VISIBLE_DEVICES=ALL (omitted) and num_gpus=1 in this case. Poor Windows PC.
  • Also the CAGroup3D.yaml: BATCH_SIZE_PER_GPU: 1
  • And windows doesn't support bash in this case! Also this time CMD / BAT files are not provided.
  • Notice the actual process arguement. Also switched to torchrun.
cd tools/
#scannet
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7861 train.py --launcher pytorch --cfg_file cfgs/scannet_models/CAGroup3D.yaml --ckpt_save_interval 1 --extra_tag cagroup3d-win10-scannet-train --fix_random_seed > ../logs/train_scannet.txt
#sunrgbd
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7862 train.py --launcher pytorch --cfg_file cfgs/sunrgbd_models/CAGroup3D.yaml --ckpt_save_interval 1 --extra_tag cagroup3d-win10-sunrgbd-train --fix_random_seed > ../logs/train_sunrgbd.txt
  • Tensorboard (cmd output is messy), original code used tensorboardX:
#scannet
tensorboard --logdir output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train/tensorboard
#sunrgbd
tensorboard --logdir output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train/tensorboard
  • Train from pretrained model (Remember to move the directory and calculate the epoch):
cd tools/
#scannet
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7861 train.py --launcher pytorch --cfg_file cfgs/scannet_models/CAGroup3D.yaml --pretrained_model ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train-good/ckpt/checkpoint_epoch_8.pth --ckpt ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train-good/ckpt/checkpoint_epoch_8.pth --epochs 9 --ckpt_save_interval 1 --extra_tag cagroup3d-win10-scannet-train --fix_random_seed > ../logs/train_scannet.txt
#sunrgbd
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7862 train.py --launcher pytorch --cfg_file cfgs/sunrgbd_models/CAGroup3D.yaml --pretrained_model ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_13.pth --ckpt ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_13.pth --epochs 14 --ckpt_save_interval 1 --extra_tag cagroup3d-win10-sunrgbd-train --fix_random_seed > ../logs/train_sunrgbd.txt

Hours for training

  • ScanNetV2: Takes around 96 hours for a single epoch. (BATCH_SIZE=16)
  • SUNRGBD V1: Takes around 36 hours for a single epoch. (BATCH_SIZE=16)
  • BATCH_SIZE Has no effect. Keep waiting.

Evaluation

  • Although the CPU usage is not very intense, do not run both evals in the same time. You may crash the OS (kernel), and get the scary GPU error code 43. Disable then re-enable the GPU driver will bring it back.
cd tools/
#scannet
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7863 test.py --launcher pytorch --cfg_file cfgs/scannet_models/CAGroup3D.yaml --ckpt ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train/ckpt/checkpoint_epoch_1.pth --extra_tag cagroup3d-win10-scannet-eval > ../logs/eval_scannet.txt
#sunrgbd
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7864 test.py --launcher pytorch --cfg_file cfgs/sunrgbd_models/CAGroup3D.yaml --ckpt ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_13.pth --extra_tag cagroup3d-win10-sunrgbd-eval > ../logs/eval_sunrgbd.txt

Hours for evaluation

  • ScanNetV2: Takes around 2 hours.
  • SUNRGBD V1: Takes around 5 hours.
  • args_workers: No obvious effect. Keep waiting.

Performance / Pretrained model and logs

  • Including our "epoch1" result, pretrained model from original author to validate our modified code, and potentially our "finetuned model" (e8+1, e12+1).
Task scannet-e1 sunrgbd-e1 scannet-e8 sunrgbd-e12 scannet-e9 sunrgbd-e13
Huggingface cagroup3d-win10-scannet cagroup3d-win10-sunrgbd Main repo Main repo cagroup3d-win10-scannet cagroup3d-win10-sunrgbd
mAP_0.25 2.6154 4.3875 74.0403 65.9022 71.2267 65.8974
mAP_0.50 0.1057 0.7867 61.2493 47.9277 56.7902 48.2091
mAR_0.25 8.0527 7.8397 89.6589 93.2833 89.8917 93.6769
mAR_0.50 0.7545 2.0583 76.1650 67.8665 73.8451 68.0411

Visualize data

  • No explaination from original repo, expected aligned with the provided demo.py rewritten from test.py.
  • Draw 10 random scenes from dataset (scannet = 312, sunrgbd = 5050).
  • Detection box is drawn if it prediction score exceed draw_scores or it is the best prediction.
  • Color scale: HSL across class labels. Sequence aligned with CLASS_NAMES.
  • Expected using Open3D (plan a) for visualisation. Mayavi has issue on view perspectives, although it has 3d labels as prediction scores.
cd tools/
#scannet
python demo.py --cfg_file ../tools/cfgs/scannet_models/CAGroup3D.yaml --ckpt ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train-good/ckpt/checkpoint_epoch_8.pth --draw_scores 0.5 --draw_idx 10
#sunrgbd
python demo.py --cfg_file ../tools/cfgs/sunrgbd_models/CAGroup3D.yaml --ckpt ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_12.pth --draw_scores 0.4 --draw_idx 10

Control keys on Open3D

  • Left Click: Rotate from focal point
  • Crtl + Left Click: Pan
  • Shift : Rotate per axis
  • PrintScreen: Print screen (Open3D Windows) in PNG + Save meta file (JSON)
  • Q: Quit

Gallery

ScreenCapture_2023-04-10-09-09-50.png Example from SunRGBD

Rants

from MinkowskiEngineBackend._C import is_cuda_available
me_device = None if is_cuda_available() else "cpu"
x = ME.SparseTensor(coordinates=c, features=f, device=me_device)
  • Just force everything into CPU.. BATCH_SIZE_PER_GPU must not be 1.
  • CHECK_CUDA failed. Checks skipped. Meanwhile switched to __device__ inline int check_rect_cross. Now get memory issue. Make sure ME runs in CPU and pcdet runs in CUDA.
  • CUDA error: device-side assert triggered. hint1 hint2 hint3 hint4 hint5 eval() on CPU.
  • Indexing error revealed. e.g.. Real debug. knn force cuda: Done.
  • long should be int_64t. long in Flutter. stackoverflow
  • sunrgbd's code coverage is larger then scannet, meanwhile the dataset is 2x smaller. Test with this dataset first. It takes 30-60 mins to crash but scannet takes 2Hrs.
  • find_unused_parameters=True is mandatory now. Not sure if we can train with multiple GPUs later on.
  • Train from checkpoint. Maybe have some spare time to train a few more EPs. 1EP should be fesible since we don't need to change code.
  • Why the model cannot be eval? Somehow some raw data is in ndarray instead of tensor. However the upside is it is already in CPU.
  • Visualization / play with estimation. There is a result.pkl without any explaination via pickle.dump, which is insufficient to visualize. Oh no demo.py is another rabbit hole. Remade with test.py and it still crashes. There is so many limitation from Open3D.
  • Adding GPU support from teammate's great work: His fork form this repo. His tryhard mod of Minkowski Engine according to this repo.. PR-ed. Will test with CPU first
  • TODO Maybe export the detections to TensorBoard also. Open3D for TensorBoard.