Cuda support and acceleration for octomaps. #257

saifullah3396 · 2019-10-25T07:52:52Z

This pull request adds the CUDA support and acceleration for generating octomap from a point cloud and has the following development notes:

The integration has no conflicts with normal/OpenMP build and works independently.
Tested on Nvidia Titan X pascal gpu. Compute and SM 61.
The original octomap classes that were used in CUDA device code such as OcTreeKey or Point3d were only updated to be made available to the device using host device tags on required functions.
The files used for CUDA have the .cuh extension.
CudaAssertion.cuh file defines basic cuda error checking utility.
TArray.cuh and TArray.cu files define a fixed-size array for usage in CUDA code.
CudaOctomapUpdater.cuh defines the CudaOctomapUpdater class which performs the update on the input point cloud and has the following properties:
- The class pre-allocates the memory for data usage in cuda so it must be made sure that the size of scan must remain the same in each iteration.
- At the moment, the total number of threads assigned for update equals the scan size. Each block size is kept at 256 and number of blocks are allocated accordingly.
- For each thread, a ray (a fixed sized array of OcTreeKeys) is pre-allocated on device memory. This part can take a large amount of memory depending upon the size of array and resolution. Currently each ray is sized as max_range / resolution.
- Two arrays (A & B) storing info about the update of a certain key are allocated for free and occupied cases and are updated by hash indices of keys.
- Two more arrays (C & D) containing the keys and hashes themselves are allocated for free and occupied cases.
- The rays are updated same as in the case of original octomap computeRayKeys() function. The function however is modified to work on device.
- Each thread works on a separate ray and iterates the elements of the ray to update the arrays A, B, C, D simultaneously.
- On each ray iteration, ray hash is found as OcTreeKey::KeyHash{} % array size.
- On each element update, we used atomicCAS operation to compare and swap on A[hash] if A[hash] == 0. This makes sure that each hash element is only handled once and allowing us to add the hash and its corresponding key to C if the above operation returns 0.
- B and D are updated in a similar fashion.
- The updater arrays C and D directly contain the keys for free and occupied cells and are updated in the tree as usual.
- After finishing one update cycle the computeUpdateReset is called for resetting all elements of A and C to 0.
Initial tests ran on FR-079 corridor and new_college show that updating large batches of data is considerably faster than smaller batches as in the case of the two data-sets respectively.
Spherical scan data_set also shows considerable increase in performance.
The output for test ran on FR-079 corridor is shown. It can be seen that it takes a large amount of memory ~ 500MB keeping each ray size to 1000 and with 89000 threads/rays. Still the performance is quite significantly improved. However I believe in normal cases with real-time updates we would not need that kind of memory since ray_size = 1000 means a range of 100 m at 0.1 resolution.

Reading Graph file
===========================

 Data points in graph: 5903426

Creating tree
===========================
ray_size_:1000
Setting up cuda updater for the scan size:89276
scan_size_:89276
Total number of threads: 89280
Total number of blocks: 348
Threads per block: 256
Memory used by rays:537108480 bytes. 
(1/66) 
(2/66) 
(3/66) 
(4/66) 
(5/66) 
(6/66) 
(7/66) 
(8/66) 
(9/66) 
(10/66) 
(11/66) 
(12/66) 
(13/66) 
(14/66) 
(15/66) 
(16/66) 
(17/66) 
(18/66) 
(19/66) 
(20/66) 
(21/66) 
(22/66) 
(23/66) 
(24/66) 
(25/66) 
(26/66) 
(27/66) 
(28/66) 
(29/66) 
(30/66) 
(31/66) 
(32/66) 
(33/66) 
(34/66) 
(35/66) 
(36/66) 
(37/66) 
(38/66) 
(39/66) 
(40/66) 
(41/66) 
(42/66) 
(43/66) 
(44/66) 
(45/66) 
(46/66) 
(47/66) 
(48/66) 
(49/66) 
(50/66) 
(51/66) 
(52/66) 
(53/66) 
(54/66) 
(55/66) 
(56/66) 
(57/66) 
(58/66) 
(59/66) 
(60/66) 
(61/66) 
(62/66) 
(63/66) 
(64/66) 
(65/66) 
(66/66) 

Done building tree.

time to insert scans: 1.83582 sec
time to insert 100.000 points took: 0.0310976 sec (avg)

Pruned tree (lossless compression)
===========================
Tree size: 205546 nodes (153635 leafs). 86709 nodes thresholded, 118837 other
Memory: 6611360 byte (6.30508 MB)
Full grid: 165327480 byte (157.669 MB)
Size: 46.6 x 36.5 x 24.3 m^3

Pruned max-likelihood tree (lossy compression)
===========================
Tree size: 163058 nodes (116458 leafs). 163058 nodes thresholded, 0 other
Memory: 5591648 byte (5.33261 MB)
Full grid: 165327480 byte (157.669 MB)
Size: 46.6 x 36.5 x 24.3 m^3


Writing tree files
===========================
Full Octree (pruned) written to fr.bt.ot
Full Octree (max.likelihood, pruned) written to fr.bt_ml.ot
Bonsai tree written to fr.bt

The output for test ran on new_college is shown. It is only ran for 5k iterations for this output.

Reading Graph file
===========================

 Data points in graph up to scan 5000: 700063

Creating tree
===========================
ray_size_:600
Setting up cuda updater for the scan size:181
scan_size_:181
Total number of threads: 192
Total number of blocks: 1
Threads per block: 192
Memory used by rays:694272 bytes. 

---

Done building tree.

time to insert scans: 9.43903 sec
time to insert 100.000 points took: 1.34831 sec (avg)

Pruned tree (lossless compression)
===========================
Tree size: 217504 nodes (139357 leafs). 44642 nodes thresholded, 172862 other
Memory: 8481792 byte (8.08887 MB)
Full grid: 39983350627760 byte (3.81311e+07 MB)
Size: 2887.7 x 2402 x 1441.1 m^3

Pruned max-likelihood tree (lossy compression)
===========================
Tree size: 213232 nodes (135619 leafs). 213232 nodes thresholded, 0 other
Memory: 8379264 byte (7.99109 MB)
Full grid: 39983350627760 byte (3.81311e+07 MB)
Size: 2887.7 x 2402 x 1441.1 m^3


Writing tree files
===========================
Full Octree (pruned) written to new_college_cuda.bt.ot
Full Octree (max.likelihood, pruned) written to new_college_cuda.bt_ml.ot
Bonsai tree written to new_college_cuda.bt

The output for test ran on spherical_scan is shown. It has 10k data points and ray size is kept at 100 ~ 10m range. The time taken is 7ms which is faster than .

Reading Graph file
===========================

 Data points in graph: 10201

Creating tree
===========================
ray_size_:100
Setting up cuda updater for the scan size:10201
scan_size_:10201
Total number of threads: 10208
Total number of blocks: 39
Threads per block: 256
Memory used by rays:6288128 bytes. 
(1/1) 

Done building tree.

time to insert scans: 0.007277 sec
time to insert 100.000 points took: 0.0713361 sec (avg)

Pruned tree (lossless compression)
===========================
Tree size: 2178 nodes (1521 leafs). 1374 nodes thresholded, 804 other
Memory: 77216 byte (0.0736389 MB)
Full grid: 44100 byte (0.042057 MB)
Size: 0.9 x 3.5 x 3.5 m^3

Pruned max-likelihood tree (lossy compression)
===========================
Tree size: 2178 nodes (1521 leafs). 2178 nodes thresholded, 0 other
Memory: 77216 byte (0.0736389 MB)
Full grid: 44100 byte (0.042057 MB)
Size: 0.9 x 3.5 x 3.5 m^3


Writing tree files
===========================
Full Octree (pruned) written to spherical_scan_cuda.ot
Full Octree (max.likelihood, pruned) written to spherical_scan_cuda_ml.ot
Bonsai tree written to spherical_scan_cuda

More tests needed to assess the performance or if there are issues with it.
Fixes Cuda acceleration #112

…yle for use in CUDA kernel.

…DEPENDENT_CODE flags are used.

…zation is not due to tree but due to CUDA backend initialization that only happens once, getting almost 25 hz with current config.

…nal iteration on device is still faster. Faster than ever before! :)

LZMSTI · 2020-01-13T08:08:30Z

@saifullah3396
Hi, saifullah3396, I am confused with the cuda-dev branch, commint id is 14d242d,
firstly, I couldn't compile this project, my system is ubuntu16.04 with nvidia graphic GTX1060 and cuda 10.0, however, when I run cmake .. it goes wrong with message in follow image:

Then when I view the details in the code, In file "CudaOctomapUpdater.cu", I am confused with function "CudaOctomapUpdater::computeUpdate":

here, you update the tree node false with occupied_hashes_device_, however, you update the tree node true with occupied_hashes_device_ as well, why? May be it should be update tree node false with free_hashes_device_?
Finally, do you have an example test for the project you commit, or whether it really accelerate the speed of computeRayKeys? Thank you very much!

saifullah3396 · 2020-01-13T16:17:18Z

Hi @LZMSTI, it seems you do not have the working compiler for cuda? Do you have nvcc installed? For the tests, I just used the graph2tree file generated in bin with original octomaps. you can try that to run it on any .scan file. As for the question, I would have to take out time and look at it and see if there is something wrong in the code there cuz it does seem wrong at first glance but its complicated. I'll let you know if there is a problem there. For now try building the code with GPU support and run the graph2tree bin on any scan file. Thanks!

saifullah3396 · 2020-01-13T17:09:23Z

@LZMSTI, it seems you're right. I guess I made a mistake in the commit. This should be free_hashes_device_. Thanks for letting me know. I'll update that soon.

dookei · 2020-05-13T14:09:57Z

Hi just pinging ,
is this already supported ?

saifullah3396 · 2020-05-13T14:18:40Z

@dookei I just opened a pr so that it could be checked by the maintainer but since it still hasn't gotten any response, I wouldn't say that it is officially supported. But the results are shown for a few tests that I ran so if you wish to use it or further develop it, you can clone my fork.

dookei · 2020-05-13T14:43:47Z

@saifullah3396 Thanks for your fast reply. I wll give it a try. My current target is a Nvidia Jetson Tx2.

saifullah3396 · 2020-05-13T17:55:25Z

@dookei You're welcome. Would love it if you could share the results on Jetson Tx2 as well. Cheers

ahornung · 2020-05-27T19:34:54Z

Hi just pinging ,
is this already supported ?

As @saifullah3396 wrote above, best try out the branch directly from here.

More feedback would be great, including further benchmark results. Ideally, there's a unit test that compares the results for different maps between the CUDA and non-CUDA implementation. They should be exactly equal.

dookei · 2020-10-20T11:02:32Z

I just add now time to try this branch.
I compiled and installed without any issues directly from this branch. But.... Its not running on CUDA/GPU.
What might I be missing? no GPU memory is also being used.
I am currently testing it under ROS noetic and and with a Geforce RTX2060.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

saifullah3396 · 2020-10-20T14:55:07Z

@dookei Hi, did you make sure to build it with CUDA_SUPPORT flag set to TRUE in cmake? see the latest commit. Thank you.

dookei · 2020-10-20T15:26:10Z

You are right I forgot to set it. Now I compiled and installed...but still not using GPU. Is there anything I need to set under octomap-server? Thanks !
PS: the lib is being copied to /usr/local/ path but there is also a lib under /usr/lib which doesnt look like the same version could that be it ?

saifullah3396 · 2020-10-20T15:57:43Z

@dookei that could be it. Can you please run the tests for your built binaries first without installing them inside /usr forlders? You should see some output making sure your build is using CUDA support. Also if you're using octomap_server, i believe it must be picking the library up from /usr/lib.

dookei · 2020-10-21T10:01:50Z

@saifullah3396 how do I do that ? Test the binaries first ?

saifullah3396 added 30 commits October 15, 2019 13:02

Updated Cmake for cuda support.

00e8cfa

Macro definition for cuda error checking.

8b7ee76

Made math classes CUDA host/device callable.

53558f6

Made coordToKey and keyToCoord functions CUDA host/device callable.

64bdc9f

Made functions used in computeKeyRay device/host callable.

f776fc6

Made OcTreeKey device/host callable.

16b66ed

Added a key container class for storing ray-traced keys in c-array st…

08e54a8

…yle for use in CUDA kernel.

declaration of CUDA based computeUpdate and and its kernel.

87810c3

Rename.

d0f277a

Definition of atomic update of array from parallel threads.

2ca9335

Definition of CUDA based computeUpdate and its kernel.

b8e410a

Added computeUpdateCUDA usage.

506179e

Added compilation of cuda files.

a371590

Removed -fPIC flag as it does not work with nvcc. Instead POSITION_IN…

5ca2a11

…DEPENDENT_CODE flags are used.

Cuda file renames and fixes.

9f4d69c

Fixed _cuda_support_ block.

ebd019f

Refactoring.

7b40c05

Refactoring.

eb1d49e

Added all tests.

c9348f2

Removed debugging info.

301f01b

Added debugging info.

c4b8998

Fixed a few issues with the cuda side. The high time on tree initiali…

76c78c2

…zation is not due to tree but due to CUDA backend initialization that only happens once, getting almost 25 hz with current config.

Better Cuda but still not better than CPU.

3cb8e60

Better and faster than before.

b76c7ea

Merged computeCuda and clean() kernels. Removed host copying since fi…

52f4348

…nal iteration on device is still faster. Faster than ever before! :)

Added lazy_eval as argument.

15d62a2

Added support for discretized scan.

5f220d2

Removed KeyValue struct as it is unused now.

43d4ad6

Hash set is unused so removed.

7faa43f

CMake refactor.

0941020

Refactoring and memory error fixes.

14d242d

saifullah3396 changed the title ~~Cuda support and acceleration for octomaps.~~ Cuda support and acceleration for octomaps. Fixes #112 Oct 25, 2019

saifullah3396 changed the title ~~Cuda support and acceleration for octomaps. Fixes #112~~ Cuda support and acceleration for octomaps. Oct 25, 2019

vwvw mentioned this pull request Jan 16, 2020

Real-time performance for LiDAR use and multiresolution map #277

Closed

Fixed a mistake of free/occupied hash updates.

a0a8c68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda support and acceleration for octomaps. #257

Cuda support and acceleration for octomaps. #257

saifullah3396 commented Oct 25, 2019 •

edited

Loading

LZMSTI commented Jan 13, 2020

saifullah3396 commented Jan 13, 2020

saifullah3396 commented Jan 13, 2020

dookei commented May 13, 2020

saifullah3396 commented May 13, 2020

dookei commented May 13, 2020

saifullah3396 commented May 13, 2020

ahornung commented May 27, 2020

dookei commented Oct 20, 2020

saifullah3396 commented Oct 20, 2020

dookei commented Oct 20, 2020 •

edited

Loading

saifullah3396 commented Oct 20, 2020

dookei commented Oct 21, 2020

Cuda support and acceleration for octomaps. #257

Are you sure you want to change the base?

Cuda support and acceleration for octomaps. #257

Conversation

saifullah3396 commented Oct 25, 2019 • edited Loading

LZMSTI commented Jan 13, 2020

saifullah3396 commented Jan 13, 2020

saifullah3396 commented Jan 13, 2020

dookei commented May 13, 2020

saifullah3396 commented May 13, 2020

dookei commented May 13, 2020

saifullah3396 commented May 13, 2020

ahornung commented May 27, 2020

dookei commented Oct 20, 2020

saifullah3396 commented Oct 20, 2020

dookei commented Oct 20, 2020 • edited Loading

saifullah3396 commented Oct 20, 2020

dookei commented Oct 21, 2020

saifullah3396 commented Oct 25, 2019 •

edited

Loading

dookei commented Oct 20, 2020 •

edited

Loading