Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda support and acceleration for octomaps. #257

Open
wants to merge 32 commits into
base: devel
Choose a base branch
from

Conversation

saifullah3396
Copy link

@saifullah3396 saifullah3396 commented Oct 25, 2019

This pull request adds the CUDA support and acceleration for generating octomap from a point cloud and has the following development notes:

  • The integration has no conflicts with normal/OpenMP build and works independently.

  • Tested on Nvidia Titan X pascal gpu. Compute and SM 61.

  • The original octomap classes that were used in CUDA device code such as OcTreeKey or Point3d were only updated to be made available to the device using host device tags on required functions.

  • The files used for CUDA have the .cuh extension.

  • CudaAssertion.cuh file defines basic cuda error checking utility.

  • TArray.cuh and TArray.cu files define a fixed-size array for usage in CUDA code.

  • CudaOctomapUpdater.cuh defines the CudaOctomapUpdater class which performs the update on the input point cloud and has the following properties:

    • The class pre-allocates the memory for data usage in cuda so it must be made sure that the size of scan must remain the same in each iteration.
    • At the moment, the total number of threads assigned for update equals the scan size. Each block size is kept at 256 and number of blocks are allocated accordingly.
    • For each thread, a ray (a fixed sized array of OcTreeKeys) is pre-allocated on device memory. This part can take a large amount of memory depending upon the size of array and resolution. Currently each ray is sized as max_range / resolution.
    • Two arrays (A & B) storing info about the update of a certain key are allocated for free and occupied cases and are updated by hash indices of keys.
    • Two more arrays (C & D) containing the keys and hashes themselves are allocated for free and occupied cases.
    • The rays are updated same as in the case of original octomap computeRayKeys() function. The function however is modified to work on device.
    • Each thread works on a separate ray and iterates the elements of the ray to update the arrays A, B, C, D simultaneously.
    • On each ray iteration, ray hash is found as OcTreeKey::KeyHash{} % array size.
    • On each element update, we used atomicCAS operation to compare and swap on A[hash] if A[hash] == 0. This makes sure that each hash element is only handled once and allowing us to add the hash and its corresponding key to C if the above operation returns 0.
    • B and D are updated in a similar fashion.
    • The updater arrays C and D directly contain the keys for free and occupied cells and are updated in the tree as usual.
    • After finishing one update cycle the computeUpdateReset is called for resetting all elements of A and C to 0.
  • Initial tests ran on FR-079 corridor and new_college show that updating large batches of data is considerably faster than smaller batches as in the case of the two data-sets respectively.

  • Spherical scan data_set also shows considerable increase in performance.

  • The output for test ran on FR-079 corridor is shown. It can be seen that it takes a large amount of memory ~ 500MB keeping each ray size to 1000 and with 89000 threads/rays. Still the performance is quite significantly improved. However I believe in normal cases with real-time updates we would not need that kind of memory since ray_size = 1000 means a range of 100 m at 0.1 resolution.

Reading Graph file
===========================

 Data points in graph: 5903426

Creating tree
===========================
ray_size_:1000
Setting up cuda updater for the scan size:89276
scan_size_:89276
Total number of threads: 89280
Total number of blocks: 348
Threads per block: 256
Memory used by rays:537108480 bytes. 
(1/66) 
(2/66) 
(3/66) 
(4/66) 
(5/66) 
(6/66) 
(7/66) 
(8/66) 
(9/66) 
(10/66) 
(11/66) 
(12/66) 
(13/66) 
(14/66) 
(15/66) 
(16/66) 
(17/66) 
(18/66) 
(19/66) 
(20/66) 
(21/66) 
(22/66) 
(23/66) 
(24/66) 
(25/66) 
(26/66) 
(27/66) 
(28/66) 
(29/66) 
(30/66) 
(31/66) 
(32/66) 
(33/66) 
(34/66) 
(35/66) 
(36/66) 
(37/66) 
(38/66) 
(39/66) 
(40/66) 
(41/66) 
(42/66) 
(43/66) 
(44/66) 
(45/66) 
(46/66) 
(47/66) 
(48/66) 
(49/66) 
(50/66) 
(51/66) 
(52/66) 
(53/66) 
(54/66) 
(55/66) 
(56/66) 
(57/66) 
(58/66) 
(59/66) 
(60/66) 
(61/66) 
(62/66) 
(63/66) 
(64/66) 
(65/66) 
(66/66) 

Done building tree.

time to insert scans: 1.83582 sec
time to insert 100.000 points took: 0.0310976 sec (avg)

Pruned tree (lossless compression)
===========================
Tree size: 205546 nodes (153635 leafs). 86709 nodes thresholded, 118837 other
Memory: 6611360 byte (6.30508 MB)
Full grid: 165327480 byte (157.669 MB)
Size: 46.6 x 36.5 x 24.3 m^3

Pruned max-likelihood tree (lossy compression)
===========================
Tree size: 163058 nodes (116458 leafs). 163058 nodes thresholded, 0 other
Memory: 5591648 byte (5.33261 MB)
Full grid: 165327480 byte (157.669 MB)
Size: 46.6 x 36.5 x 24.3 m^3


Writing tree files
===========================
Full Octree (pruned) written to fr.bt.ot
Full Octree (max.likelihood, pruned) written to fr.bt_ml.ot
Bonsai tree written to fr.bt
  • The output for test ran on new_college is shown. It is only ran for 5k iterations for this output.
Reading Graph file
===========================

 Data points in graph up to scan 5000: 700063

Creating tree
===========================
ray_size_:600
Setting up cuda updater for the scan size:181
scan_size_:181
Total number of threads: 192
Total number of blocks: 1
Threads per block: 192
Memory used by rays:694272 bytes. 

---

Done building tree.

time to insert scans: 9.43903 sec
time to insert 100.000 points took: 1.34831 sec (avg)

Pruned tree (lossless compression)
===========================
Tree size: 217504 nodes (139357 leafs). 44642 nodes thresholded, 172862 other
Memory: 8481792 byte (8.08887 MB)
Full grid: 39983350627760 byte (3.81311e+07 MB)
Size: 2887.7 x 2402 x 1441.1 m^3

Pruned max-likelihood tree (lossy compression)
===========================
Tree size: 213232 nodes (135619 leafs). 213232 nodes thresholded, 0 other
Memory: 8379264 byte (7.99109 MB)
Full grid: 39983350627760 byte (3.81311e+07 MB)
Size: 2887.7 x 2402 x 1441.1 m^3


Writing tree files
===========================
Full Octree (pruned) written to new_college_cuda.bt.ot
Full Octree (max.likelihood, pruned) written to new_college_cuda.bt_ml.ot
Bonsai tree written to new_college_cuda.bt
  • The output for test ran on spherical_scan is shown. It has 10k data points and ray size is kept at 100 ~ 10m range. The time taken is 7ms which is faster than .
Reading Graph file
===========================

 Data points in graph: 10201

Creating tree
===========================
ray_size_:100
Setting up cuda updater for the scan size:10201
scan_size_:10201
Total number of threads: 10208
Total number of blocks: 39
Threads per block: 256
Memory used by rays:6288128 bytes. 
(1/1) 

Done building tree.

time to insert scans: 0.007277 sec
time to insert 100.000 points took: 0.0713361 sec (avg)

Pruned tree (lossless compression)
===========================
Tree size: 2178 nodes (1521 leafs). 1374 nodes thresholded, 804 other
Memory: 77216 byte (0.0736389 MB)
Full grid: 44100 byte (0.042057 MB)
Size: 0.9 x 3.5 x 3.5 m^3

Pruned max-likelihood tree (lossy compression)
===========================
Tree size: 2178 nodes (1521 leafs). 2178 nodes thresholded, 0 other
Memory: 77216 byte (0.0736389 MB)
Full grid: 44100 byte (0.042057 MB)
Size: 0.9 x 3.5 x 3.5 m^3


Writing tree files
===========================
Full Octree (pruned) written to spherical_scan_cuda.ot
Full Octree (max.likelihood, pruned) written to spherical_scan_cuda_ml.ot
Bonsai tree written to spherical_scan_cuda

…zation is not due to tree but due to CUDA backend initialization that only happens once, getting almost 25 hz with current config.
…nal iteration on device is still faster. Faster than ever before! :)
@saifullah3396 saifullah3396 changed the title Cuda support and acceleration for octomaps. Cuda support and acceleration for octomaps. Fixes #112 Oct 25, 2019
@saifullah3396 saifullah3396 changed the title Cuda support and acceleration for octomaps. Fixes #112 Cuda support and acceleration for octomaps. Oct 25, 2019
@LZMSTI
Copy link

LZMSTI commented Jan 13, 2020

@saifullah3396
Hi, saifullah3396, I am confused with the cuda-dev branch, commint id is 14d242d,
firstly, I couldn't compile this project, my system is ubuntu16.04 with nvidia graphic GTX1060 and cuda 10.0, however, when I run cmake .. it goes wrong with message in follow image:
cmakeError

Then when I view the details in the code, In file "CudaOctomapUpdater.cu", I am confused with function "CudaOctomapUpdater::computeUpdate":
problem
here, you update the tree node false with occupied_hashes_device_, however, you update the tree node true with occupied_hashes_device_ as well, why? May be it should be update tree node false with free_hashes_device_?
Finally, do you have an example test for the project you commit, or whether it really accelerate the speed of computeRayKeys? Thank you very much!

@saifullah3396
Copy link
Author

Hi @LZMSTI, it seems you do not have the working compiler for cuda? Do you have nvcc installed? For the tests, I just used the graph2tree file generated in bin with original octomaps. you can try that to run it on any .scan file. As for the question, I would have to take out time and look at it and see if there is something wrong in the code there cuz it does seem wrong at first glance but its complicated. I'll let you know if there is a problem there. For now try building the code with GPU support and run the graph2tree bin on any scan file. Thanks!

@saifullah3396
Copy link
Author

@LZMSTI, it seems you're right. I guess I made a mistake in the commit. This should be free_hashes_device_. Thanks for letting me know. I'll update that soon.

@dookei
Copy link

dookei commented May 13, 2020

Hi just pinging ,
is this already supported ?

@saifullah3396
Copy link
Author

@dookei I just opened a pr so that it could be checked by the maintainer but since it still hasn't gotten any response, I wouldn't say that it is officially supported. But the results are shown for a few tests that I ran so if you wish to use it or further develop it, you can clone my fork.

@dookei
Copy link

dookei commented May 13, 2020

@saifullah3396 Thanks for your fast reply. I wll give it a try. My current target is a Nvidia Jetson Tx2.

@saifullah3396
Copy link
Author

@dookei You're welcome. Would love it if you could share the results on Jetson Tx2 as well. Cheers

@ahornung
Copy link
Member

Hi just pinging ,
is this already supported ?

As @saifullah3396 wrote above, best try out the branch directly from here.

More feedback would be great, including further benchmark results. Ideally, there's a unit test that compares the results for different maps between the CUDA and non-CUDA implementation. They should be exactly equal.

@dookei
Copy link

dookei commented Oct 20, 2020

I just add now time to try this branch.
I compiled and installed without any issues directly from this branch. But.... Its not running on CUDA/GPU.
What might I be missing? no GPU memory is also being used.
I am currently testing it under ROS noetic and and with a Geforce RTX2060.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

@saifullah3396
Copy link
Author

@dookei Hi, did you make sure to build it with CUDA_SUPPORT flag set to TRUE in cmake? see the latest commit. Thank you.

@dookei
Copy link

dookei commented Oct 20, 2020

You are right I forgot to set it. Now I compiled and installed...but still not using GPU. Is there anything I need to set under octomap-server? Thanks !
PS: the lib is being copied to /usr/local/ path but there is also a lib under /usr/lib which doesnt look like the same version could that be it ?

@saifullah3396
Copy link
Author

@dookei that could be it. Can you please run the tests for your built binaries first without installing them inside /usr forlders? You should see some output making sure your build is using CUDA support. Also if you're using octomap_server, i believe it must be picking the library up from /usr/lib.

@dookei
Copy link

dookei commented Oct 21, 2020

@saifullah3396 how do I do that ? Test the binaries first ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cuda acceleration
4 participants