Detection from a "finetuned" model solely in Windows 10 PC.
- (Passed) Testing environment: X299 + i7-7820X + Win10 22H2 + RTX 2080 Ti +
72GB128GB DDR4 + 480GB SATA SSD - (Passed) Another testing environment: C602 + 2x E5-2650V4 + Win10 22H1 + GTX 1080 Ti + 256GB DDR4 + 500GB SATA SSD
- WSL2 does not work. CUDA crash hopelessly.
- Train with joint dataset (ScanNetV2 + Sun RGB-D) and examine the result against both tasks.
- Not focused on reproduce the data (obviously different CUDA version will produce different results)
Some live demo with Jupyter notebook
-
(Optional) VSCode has terminal which is not easily interrupted and notepad++ for non ascii display.
-
Newest GPU driver. CUDA version in this repo will be 11.7. Use
nvidia-smi
to check. -
Python
3.8+3.10. To switch POSIX only library to native implementation. -
Prepare at least 40GB (dataset) + 20GB (programs) of disk space!
-
Install CUDA Toolkit 11.7
-
Install anaconda. Miniconda would be more flexable.
-
Install Microsoft C++ Build Tools
-
Ref CSDN Ref stackoverflow Modify
host_config.h
:_MSC_VER >= 2000
-
Prepare a python environment (python 3.10 + pytorch 1.13.1+cu117 + spconv cu117)
copy manually in cmd:
# python=3.11 will crash in application!
conda create -n cagroup3d-env -c conda-forge scikit-learn python=3.10
conda activate cagroup3d-env
# Gamble on cu117 (nvidia-smi shows GTX 2080Ti + CUDA 12.1), as pytorch has cu117 also
pip install spconv-cu117
# Yea, need torch. Must be 1.13.1.
pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117
# Version conflict between numpy and numba
conda install -c conda-forge numba
# OMP: Error #15. Alternatively you can set a OS flag
#os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
conda install -c conda-forge nomkl
# Tensorboard
conda install -c conda-forge tensorboard
# For evalulation
conda install -c conda-forge terminaltables
# For "plan a" of data visualisation
conda install -c conda-forge matplotlib
# For "plan b" of data visualisation
conda install -c conda-forge mayavi
- CPU only MinkowskiEngine is troublesome. Head to this git issue and download the windows package. If you're using non 3.10, you may need to manually make the package. WSL2 is not working.
pip install ninja open3d
pip install MinkowskiEngine-0.5.4-cp310-cp310-win_amd64.whl
-
GPU support Please check from teammate's great work: His fork form this repo. His tryhard mod of Minkowski Engine according to this repo.. PR to this repo. Read his guide to build the ME from scratch. It takes some time to do so. Also, it only support single GPU at this moment (I'll try with dual GPU later). It does hardcode
cuda:0
becuase of CPP side issues. -
Now it is good to clone. Final check:
# Should return True
python -c "import torch; print(torch.cuda.is_available())"
# Should also return True (GPU ME) and False (CPU ME)
python -c "import MinkowskiEngine as ME; from MinkowskiEngineBackend._C import is_cuda_available; print(is_cuda_available())"
Endless CPP debugging:
cd CAGroup3D
python setup.py develop > logs/pcdet.txt
- Additional CUDA ops looks fine:
# rotate iou ops
cd CAGroup3D/pcdet/ops/rotated_iou/cuda_op
python setup.py install > ../../../../logs/cuda_ops_rotated_iou.txt
# knn ops
cd ../../knn
python setup.py develop > ../../../logs/cuda_ops_knn.txt
- Use prepared dataset..
DATA_PATH
could be in full path. Currently placed as../data/scannet_data/ScanNetV2
and../data/sunrgbd_data/sunrgbd
.
- Note that
CUDA_VISIBLE_DEVICES=ALL
(omitted) andnum_gpus=1
in this case.Poor Windows PC. - Also the
CAGroup3D.yaml
:BATCH_SIZE_PER_GPU: 1
- And windows doesn't support bash in this case! Also this time CMD / BAT files are not provided.
- Notice the actual process arguement. Also switched to torchrun.
cd tools/
#scannet
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7861 train.py --launcher pytorch --cfg_file cfgs/scannet_models/CAGroup3D.yaml --ckpt_save_interval 1 --extra_tag cagroup3d-win10-scannet-train --fix_random_seed > ../logs/train_scannet.txt
#sunrgbd
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7862 train.py --launcher pytorch --cfg_file cfgs/sunrgbd_models/CAGroup3D.yaml --ckpt_save_interval 1 --extra_tag cagroup3d-win10-sunrgbd-train --fix_random_seed > ../logs/train_sunrgbd.txt
- Tensorboard (cmd output is messy), original code used tensorboardX:
#scannet
tensorboard --logdir output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train/tensorboard
#sunrgbd
tensorboard --logdir output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train/tensorboard
- Train from pretrained model (Remember to move the directory and calculate the epoch):
cd tools/
#scannet
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7861 train.py --launcher pytorch --cfg_file cfgs/scannet_models/CAGroup3D.yaml --pretrained_model ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train-good/ckpt/checkpoint_epoch_8.pth --ckpt ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train-good/ckpt/checkpoint_epoch_8.pth --epochs 9 --ckpt_save_interval 1 --extra_tag cagroup3d-win10-scannet-train --fix_random_seed > ../logs/train_scannet.txt
#sunrgbd
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7862 train.py --launcher pytorch --cfg_file cfgs/sunrgbd_models/CAGroup3D.yaml --pretrained_model ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_13.pth --ckpt ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_13.pth --epochs 14 --ckpt_save_interval 1 --extra_tag cagroup3d-win10-sunrgbd-train --fix_random_seed > ../logs/train_sunrgbd.txt
ScanNetV2
: Takes around 96 hours for a single epoch. (BATCH_SIZE=16
)SUNRGBD V1
: Takes around 36 hours for a single epoch. (BATCH_SIZE=16
)BATCH_SIZE
Has no effect. Keep waiting.
- Although the CPU usage is not very intense, do not run both evals in the same time. You may crash the OS (kernel), and get the scary GPU error code 43. Disable then re-enable the GPU driver will bring it back.
cd tools/
#scannet
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7863 test.py --launcher pytorch --cfg_file cfgs/scannet_models/CAGroup3D.yaml --ckpt ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train/ckpt/checkpoint_epoch_1.pth --extra_tag cagroup3d-win10-scannet-eval > ../logs/eval_scannet.txt
#sunrgbd
torchrun --nproc_per_node=1 --rdzv_endpoint=localhost:7864 test.py --launcher pytorch --cfg_file cfgs/sunrgbd_models/CAGroup3D.yaml --ckpt ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_13.pth --extra_tag cagroup3d-win10-sunrgbd-eval > ../logs/eval_sunrgbd.txt
ScanNetV2
: Takes around 2 hours.SUNRGBD V1
: Takes around 5 hours.args_workers
: No obvious effect. Keep waiting.
- Including our "epoch1" result, pretrained model from original author to validate our modified code, and potentially our "finetuned model" (
e8+1
,e12+1
).
Task | scannet-e1 |
sunrgbd-e1 |
scannet-e8 |
sunrgbd-e12 |
scannet-e9 |
sunrgbd-e13 |
---|---|---|---|---|---|---|
Huggingface | cagroup3d-win10-scannet | cagroup3d-win10-sunrgbd | Main repo | Main repo | cagroup3d-win10-scannet | cagroup3d-win10-sunrgbd |
mAP_0.25 |
2.6154 | 4.3875 | 74.0403 | 65.9022 | 71.2267 | 65.8974 |
mAP_0.50 |
0.1057 | 0.7867 | 61.2493 | 47.9277 | 56.7902 | 48.2091 |
mAR_0.25 |
8.0527 | 7.8397 | 89.6589 | 93.2833 | 89.8917 | 93.6769 |
mAR_0.50 |
0.7545 | 2.0583 | 76.1650 | 67.8665 | 73.8451 | 68.0411 |
- No explaination from original repo,
expected aligned with the providedrewritten fromdemo.py
test.py
. - Draw 10 random scenes from dataset (
scannet
= 312,sunrgbd
= 5050). - Detection box is drawn if it prediction score exceed
draw_scores
or it is the best prediction. - Color scale: HSL across class labels. Sequence aligned with
CLASS_NAMES
. - Expected using Open3D (plan a) for visualisation. Mayavi has issue on view perspectives, although it has 3d labels as prediction scores.
cd tools/
#scannet
python demo.py --cfg_file ../tools/cfgs/scannet_models/CAGroup3D.yaml --ckpt ../output/scannet_models/CAGroup3D/cagroup3d-win10-scannet-train-good/ckpt/checkpoint_epoch_8.pth --draw_scores 0.5 --draw_idx 10
#sunrgbd
python demo.py --cfg_file ../tools/cfgs/sunrgbd_models/CAGroup3D.yaml --ckpt ../output/sunrgbd_models/CAGroup3D/cagroup3d-win10-sunrgbd-train-good/ckpt/checkpoint_epoch_12.pth --draw_scores 0.4 --draw_idx 10
Left Click
: Rotate from focal pointCrtl
+ Left Click: PanShift
: Rotate per axisPrintScreen
: Print screen (Open3D Windows) in PNG + Save meta file (JSON)Q
: Quit
- See the gallery for details.
- error C2131 on EPS
- error C2131: expression did not evaluate to a constant
- Still C2131:
- 'uint32_t' does not name a type:
#include <cstdint>
, and checkinline int check_rect_cross
iniou3d_nms_kernel.cu
- THC/THC.h: No such file or directory. Use ATen instead
- "sys/mman.h": No such file or directory Install gygwin with additional packages:
gcc-core gcc-debuginfo gcc-objc gcc-g++ gdb make
: Not effective, but can workaround by using WSL2. Rewrite the code to removeSharedArray
instead. - The training backend is switched to GLOO.
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
is needed. ME can't be installed in WSL2.convolution_cpu.cpp:61, assertion (!kernel.is_cuda()) failed. kernel must be CPU
:ME.SparseTensor(device="cpu")
globally
from MinkowskiEngineBackend._C import is_cuda_available
me_device = None if is_cuda_available() else "cpu"
x = ME.SparseTensor(coordinates=c, features=f, device=me_device)
Just force everything into CPU..BATCH_SIZE_PER_GPU
must not be 1.- CHECK_CUDA failed.
Checks skipped.Meanwhile switched to__device__ inline int check_rect_cross
.Now get memory issue.Make sure ME runs in CPU and pcdet runs in CUDA. - CUDA error: device-side assert triggered. hint1 hint2 hint3 hint4 hint5 eval() on CPU.
- Indexing error revealed. e.g.. Real debug.
knn
force cuda: Done. - long should be int_64t. long in Flutter. stackoverflow
- sunrgbd's code coverage is larger then scannet, meanwhile the dataset is 2x smaller. Test with this dataset first.
It takes 30-60 mins to crash but scannet takes 2Hrs. find_unused_parameters=True
is mandatory now. Not sure if we can train with multiple GPUs later on.- Train from checkpoint.
Maybe have some spare time to train a few more EPs.1EP should be fesible since we don't need to change code. - Why the model cannot be eval? Somehow some raw data is in
ndarray
instead oftensor
. However the upside is it is already in CPU. - Visualization / play with estimation. There is a
result.pkl
without any explainationviapickle.dump
, which is insufficient to visualize. Oh nodemo.py
is another rabbit hole. Remade withtest.py
and it still crashes.There is so many limitation from Open3D. - Adding GPU support from teammate's great work: His fork form this repo. His tryhard mod of Minkowski Engine according to this repo.. PR-ed. Will test with CPU first
- TODO Maybe export the detections to TensorBoard also. Open3D for TensorBoard.