-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda support and acceleration for octomaps. #257
base: devel
Are you sure you want to change the base?
Conversation
…yle for use in CUDA kernel.
…DEPENDENT_CODE flags are used.
…zation is not due to tree but due to CUDA backend initialization that only happens once, getting almost 25 hz with current config.
…nal iteration on device is still faster. Faster than ever before! :)
@saifullah3396 Then when I view the details in the code, In file "CudaOctomapUpdater.cu", I am confused with function "CudaOctomapUpdater::computeUpdate": |
Hi @LZMSTI, it seems you do not have the working compiler for cuda? Do you have nvcc installed? For the tests, I just used the graph2tree file generated in bin with original octomaps. you can try that to run it on any .scan file. As for the question, I would have to take out time and look at it and see if there is something wrong in the code there cuz it does seem wrong at first glance but its complicated. I'll let you know if there is a problem there. For now try building the code with GPU support and run the graph2tree bin on any scan file. Thanks! |
@LZMSTI, it seems you're right. I guess I made a mistake in the commit. This should be free_hashes_device_. Thanks for letting me know. I'll update that soon. |
Hi just pinging , |
@dookei I just opened a pr so that it could be checked by the maintainer but since it still hasn't gotten any response, I wouldn't say that it is officially supported. But the results are shown for a few tests that I ran so if you wish to use it or further develop it, you can clone my fork. |
@saifullah3396 Thanks for your fast reply. I wll give it a try. My current target is a Nvidia Jetson Tx2. |
@dookei You're welcome. Would love it if you could share the results on Jetson Tx2 as well. Cheers |
As @saifullah3396 wrote above, best try out the branch directly from here. More feedback would be great, including further benchmark results. Ideally, there's a unit test that compares the results for different maps between the CUDA and non-CUDA implementation. They should be exactly equal. |
I just add now time to try this branch. nvcc --version |
@dookei Hi, did you make sure to build it with CUDA_SUPPORT flag set to TRUE in cmake? see the latest commit. Thank you. |
You are right I forgot to set it. Now I compiled and installed...but still not using GPU. Is there anything I need to set under octomap-server? Thanks ! |
@dookei that could be it. Can you please run the tests for your built binaries first without installing them inside /usr forlders? You should see some output making sure your build is using CUDA support. Also if you're using octomap_server, i believe it must be picking the library up from /usr/lib. |
@saifullah3396 how do I do that ? Test the binaries first ? |
This pull request adds the CUDA support and acceleration for generating octomap from a point cloud and has the following development notes:
The integration has no conflicts with normal/OpenMP build and works independently.
Tested on Nvidia Titan X pascal gpu. Compute and SM 61.
The original octomap classes that were used in CUDA device code such as OcTreeKey or Point3d were only updated to be made available to the device using host device tags on required functions.
The files used for CUDA have the .cuh extension.
CudaAssertion.cuh file defines basic cuda error checking utility.
TArray.cuh and TArray.cu files define a fixed-size array for usage in CUDA code.
CudaOctomapUpdater.cuh defines the CudaOctomapUpdater class which performs the update on the input point cloud and has the following properties:
Initial tests ran on FR-079 corridor and new_college show that updating large batches of data is considerably faster than smaller batches as in the case of the two data-sets respectively.
Spherical scan data_set also shows considerable increase in performance.
The output for test ran on FR-079 corridor is shown. It can be seen that it takes a large amount of memory ~ 500MB keeping each ray size to 1000 and with 89000 threads/rays. Still the performance is quite significantly improved. However I believe in normal cases with real-time updates we would not need that kind of memory since ray_size = 1000 means a range of 100 m at 0.1 resolution.
Fixes Cuda acceleration #112