-
Notifications
You must be signed in to change notification settings - Fork 61
cudaStreams for optimizing GPU performnce
Related to issue #553 on using multiple cuda streams for running a simulation on the gpu.
When a gpu backend is available and the cell type has a gpu implementation all cells in a simulation are grouped into one cell group and the simulation is run by performing operations on the cell group as a whole on a single cudaStream. The key part of such a simulation is fvm_lowered_cell_impl’s integrate function. The function contains a loop that is mostly serial (within an iteration and across iterations) with some exceptions: within an iteration, the mechanism current updates can be executed in parallel, and the mechanism state updates can be executed in parallel. We can extract more parallelism if we split the big cell group into smaller ones and run the integrate loops on them in parallel.
Within an iteration of the integrate loop the following events are executed: events are delivered and mechanism current contributions are accumulated; the event list and integration step times are updated; samples are taken; the voltage is integrated by assembling and solving a matrix; the mechanism state is integrated; the ion concentration is updated; spike threshold crossings are tested. Assuming 3 mechanism types in a simulation, that is a total of 28 kernel launches per iteration. Most of the execution time is taken up by the matrix_solve kernel, and the current and state update kernels (which take more time as the number of cells, compartments and/or synapses grows). Using the same assumption of 3 mechanism types, that is 8/28 kernels which dominate the execution time. The rest are short kernels (~2 us)
A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.
We can use cudaStreams to execute kernels concurrently on the GPU when possible.
The first opportunity for concurrent execution is mechanism updates. We assign a stream for every mechanism to execute the current/state updates asynchronously. There were no performance gains from this approach. This is because the current and state updates are kernels whose execution time only becomes significant when there is a large number of cells/compartments/synapses; and when that condition is satisfied, the kernels are able to occupy the entire GPU for their execution. So, launching them in parallel yields no benefit, since the GPU is already fully occupied.
The idea behind this approach is to overlap the execution of the current/state update kernels (and others) with the execution of the matrix_solve kernel, which needs time to complete, but uses very few resources on the GPU. For that, we assigned each CPU thread a cudaStream, and executed on each CPU thread a section of the big cell group.
We tested this implementation by splitting the big cell group into 4 cell groups, executing each simulation on a separate cudaStream. We evaluated the performance using the miniapp with different configurations. This implementation did not show performance improvements in any configuration.
To understand the performance, we used nvvp and nvprof to profile the GPU performance. We ran the miniapp with default parameters using only the default stream (1 big cell group), and using 4 cudaStreams (4 cell groups). We note that in this case the execution time is dominated by only the solve_matrix kernel (~80% of execution time). We don’t show a case where a significant part of the execution is dominated by the current and state updates as the analysis is similar to the provided case.
* miniapp profile using only the default stream
Using only the default stream we can measure the time for 1 iteration of the integrate loop:
Time (us) | |
---|---|
solve_matrix (1 kernel) | 878.901 |
All others (27 kernels) | 284.467 |
Total | 1163.154 |
*miniapp profile using 4 cudaStreams - complete view
Using 4 cudaStreams we can measure the average time per stream for 1 iteration of the integrate loop:
Time (us) | |
---|---|
solve_matrix (1 kernel) | 888.207 |
All others (27 kernels) | 530.886 |
Total | 1419.093 |
We can see that the time it takes to execute the 27 kernels almost doubles per iteration on average for each cudaStream; even though some of the kernels execute in parallel on the GPU. If we take a closer look at the kernel view per thread, we can see that the kernels become far in-between before the start of a long kernel.
*miniapp profile using 4 cudaStreams - per stream view
Upon closer inspection, we can see that that is when the execution of the kernels catches up to the launching of the kernels; i.e. the GPU runs out of queued kernels to execute and has to stall waiting for new ones. This is an effect of having a lot of small kernels with short execution times (1-15 us). The problem is exacerbated by the fact that kernel launches take longer on a multithreaded system (~45 us when launching from multiple threads vs. ~7 us when launching from a single thread).
*kernel launch to kernel execute delay
In addition to that, we also notice that even when there are enough kernels queued up, the time between finishing one kernel and starting to execute the next is longer when using multiple cudaStreams than when using a single stream (4-5x longer)
Unfortunately these cases are not well documented or explained. I refer to a post from NVIDIA developer forums that was not sufficiently answered which describes similar observations: Overlapping kernel computing with stream per (CPU) thread, slow kernel launches