You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my project, I am using PointCNN for a segmentation task. Recently, I did a performance testing using Nvidia Nsight System to identify potential bottlenecks. During these test, I observed that the KNN kernel consumed approximately 89% of the total inference time, which seems abnormally high.
Below, I have included several screenshots that highlight this performance issue:
CUDA Kernel Summary. The KNN kernel took ~89% of the total inference time.
Single batch inference analysis. The KNN operation within the dec3 layer consumed nearly half of the inference time.
KNN/FPS execution time and input shapes. The tests were conducted with a batch size of 24, where each item consisted of 8192 point samples.
The execution time of the KNN operation appears to increase exponentially as the value of the k parameter grows. Below are some examples of execution times with varying k values (same number of input points but with different numbers of neighbors):
Layer
k
Execution Time (ms)
conv1
8
13
dec4
32
241
dec3
48
681
I reviewed the CUDA implementation of KNN and suspect that the main reason for this slowdown is related to adjusting to best_dist and best_idx arrays.
// n_y is current request point, // for which we going to calculate k nearest neighbors across n_x points // for every input point for (int64_t n_x = ptr_x[example_idx]; n_x < ptr_x[example_idx + 1]; n_x++) {
// ...// calculate distance from n_y to n_x, save into tmp_dist and // ...// adjust best_dist and best_idx arrays on every step// probably the slowest part with increased k valuefor (int64_te1 = 0; e1 < k; e1++) {
if (best_dist[e1] > tmp_dist) {
for (int64_te2 = k - 1; e2 > e1; e2--) {
best_dist[e2] = best_dist[e2 - 1];
best_idx[e2] = best_idx[e2 - 1];
}
best_dist[e1] = tmp_dist;
best_idx[e1] = n_x;
break;
}
}
}
So, I think there are several main issues with a current code:
Managing the best_dist array inside knn_kernel appears to take significant time and is not the most efficient code to run on GPU.
Recomputing distances for each FPS/KNN call seems inefficient, probably there is sense to do it once.
Questions
Is there a fundamental issue with my implementation or an incorrect usage of the KNN/FPS operations?
Would pre-computing the distances between points on the CPU within the data loader be a good option to consider?
Implementation Details
Below is the PointCNN implementation used in this project (the model is run through torch.compile, excluding the KNN and FPS operations):
In my project, I am using PointCNN for a segmentation task. Recently, I did a performance testing using Nvidia Nsight System to identify potential bottlenecks. During these test, I observed that the KNN kernel consumed approximately 89% of the total inference time, which seems abnormally high.
Below, I have included several screenshots that highlight this performance issue:
dec3
layer consumed nearly half of the inference time.The execution time of the KNN operation appears to increase exponentially as the value of the
k
parameter grows. Below are some examples of execution times with varyingk
values (same number of input points but with different numbers of neighbors):I reviewed the CUDA implementation of KNN and suspect that the main reason for this slowdown is related to adjusting to
best_dist
andbest_idx
arrays.So, I think there are several main issues with a current code:
best_dist
array insideknn_kernel
appears to take significant time and is not the most efficient code to run on GPU.Questions
Implementation Details
Below is the PointCNN implementation used in this project (the model is run through
torch.compile
, excluding the KNN and FPS operations):XConv implementation:
Environment Details
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: