-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drastical performance degradation when switching from StarPU-1.3.11 to StarPU-1.4.4 on a GPU node #33
Comments
Dear StarPU team, I think I figured out what is the reason for 10x performance drop of my application. I disabled kernels and printed bus stats for 1.3.11 and 1.4.4 versions of StarPU. StarPU-1.3.11:
StarPU-1.4.4:
For some reason DMDAR and other DM** schedulers in StarPU-1.4.4 send nearly twice more data. And if I specifically take a look at the slowest part, namely PCI-express connection between CPU and GPUs, the new 1.4.4 version sends 65 times more data compared to old version 1.3.11. Could you please advice if there is a way in StarPU-1.4.4 to put this data transmission overload of 1.4.4 StarPU back to where it was with 1.3.11 version? I believe there is something wrong with the Memory Manager. P.S. Enabling CUDA memory map leads to the following error: P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU. |
This is unexpected of course :) Particularly since the 1.3 series introduces heuristics which are precisely meant to improve the overall flow of data. AIUI, the involved matrices can completely fit in even just one GPU? Could you also post results with starpu 1.3.0? To make sure whether it's the 1.2->1.3 development that introduced the first regression, or possibly some backports from the 1.4.x series to the 1.3.x series. Could you also post the output of Ideally, if you could provide your testcase with an LGPL-2.1+ licence, we could integrate it in our testsuite, and with simulation support we could add non-regression check-up.
That's not precise enough for us to be able to act :) |
|
And another update This time I tried other application NNTileI took a look at data transfers and total execution time (reported by /usr/bin/time utility) for different versions. Total amount of transferred data sorted in descending order:
As one can see, 1.3.x indeed saves a lot of data transmissions. However, 1.4.4 version brings all those transmissions back. Seems like some old code was brought back by 1.4.x release chain. The main problem comes from CPU<->GPU transfers, as 1.4.4 version transfers through a slow PCI-e bus around 65 times more than 1.3.11 version. FilesHere are the more detailed transmission reports provided by STARPU_BUS_STATS=1 env variable. transfers_starpu_1.2.10_dmdar.txt transfers_starpu_1.3.0_dmdar.txt |
I mean: please provide the error message. "similar error" doesn't allow us to have any idea what this is about. Also, again: Otherwise it's really not surprising that dm* etc. get everything wrong. I don't have easy access to an 8-gpu machine, so I tried with simulation, and got results that actually see 1.4.4 get better result than 1.3.11 and 1.2.10... So I really need details on how things are going on the machine where you can reproduce the issue. Also, providing us with the .starpu/sampling/bus/ and codelet/45/ files corresponding to the machine would allow me to simulate the exact same architecture, rather than simulating some 8-gpu machine I happened to have access to at some point. |
Here are the files: P.S. How can I help you simulate my runs? I compiled StarPU without SimGrid support. Traces by FXT weight more than 1 GB. Do not know if giving you contents of codelets/45/ or codelets/44 will help. However, here are the contents of /bus samplings. |
Ok, so you have an nvswitch, which wasn't the case of the machine I was simulated, that can explain why I wasn't seeing the problem.
By providing the information I'm asking :)
Simgrid is only needed for the replay part, not for the calibration part.
We don't need traces :)
Yes, please, to be sure to have the same timings as on your machine. |
there's one odd thing here compared to the others: CUDA 0 has very low bandwidth, whatever the peer. Is this reproducible when you force bus re-calibration with |
I double checked. It remains the same. CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s. With StarPU-1.3.11 the speeds are around 25 GB/s. StarPU-1.4.4 bandwidthbandwidth (MB/s) and latency (us)...
StarPU-1.3.11 bandwidthbandwidth (MB/s) and latency (us)...
|
Looking at latencies of StarPU-1.4.4:
StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise! |
Not only that, but also the gpu-gpu connexions are not getting the nvswitch speed, that's really odd. |
The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed. |
It might be useful to see the config.log output in the 1.4.4 case. |
and I can easily reproduce that here, good |
(will work on it later next week, though, but at least we have a clear culprit here) |
Thank you! I will be on vacation next week, but after that I will prepare backtraces of initially described failed assertions for StarPU-1.4.4:
By the way, the version StarPU-1.3.11 gave me an error CUDA out-of-memory with STARPU_REDUX access modes. Setting |
Compiling StarPU-1.4.4 with a flag --enable-maxnumanodes=1 solves the issue with latencies and bandwidth bringing result of STARPU_machine_display to the same of version 1.3.11. However, performance if actual computations is the same as without the flag. Amount of data transfers is still large, as reported in one of the messages above. |
Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow. |
We were previously mixing memory node index and raw memory index. The latter includes all devices, included the disabled ones! This was notably making machines with NUMA nodes and an NVSwitch take very wrong bandwidth values for the first GPUs. See github #33 (cherry picked from commit 5a72a632e186896a599c5c7e51857d0422837546)
We were previously mixing memory node index and raw memory index. The latter includes all devices, included the disabled ones! This was notably making machines with NUMA nodes and an NVSwitch take very wrong bandwidth values for the first GPUs. See github #33
Thank you! I tried the new commit. It fixes output of |
It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box. |
Another update. This time the hardware server is different (4x Nvidia V100 SXM2). For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable |
It seems that thread binding got broken in the 1.3 series indeed. I backported some fixes from 1.4, which should fix it (by looking at the pci bus numbers in your v100 case the gpus should be driven from numa0, not 1)
The CUDA cost itself is already that order of magnitude, unfortunately.
They probably have the same binding issue, just with much lower overhead probably. |
Ok, there was a typo in starpu-1.3 which didn't pose problem there, but ended up posing problem to 1.4, thus why it went unnoticed. This should now be fixed by Then I'll check the scheduling part |
I tried new commit in the starpu-1.3 branch and it got even worse, just like with starpu-1.4.4 case. Take a look at |
Ok, with the update it gained the need for the same fix as in 1.4 ( |
New output of
|
Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the
I however also need to add a small piece of code to make it known to the perfmodel. In the meanwhile, you can try to make |
The |
The fix mentioned above can also fix that case, because we use the performance prediction for selecting the source node for transfers in |
Before starpu 1.4, we were just using the observed bandwidth to decide where to place the thread driving the gpu, so it might happen that with (mis-)luck, CUDA0 happens to get just a bit more bandwidth from NUMA1. Starting from starpu 1.4 we use the hwloc information, which is much more stable :) |
Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends) |
Ah, sorry, you meant the GPU bandwidth itself. I was thinking about the NVSwitch:
Do you mean that the total internal bandwidth of the NVSwitch doesn't allow a full 250GB/s for each GPU? Ideally that's the bandwidth I'd like to get access to. Possibly we'll just resort to just measuring it. |
Turning off
And during configuration:
I see clearly that the library is present at /usr/lib64. But It is not used somehow. |
Could you post the whole config.log? |
Surely! I am using a cluster with SLURM.So, I configure and compile on an access node, which lacks CUDA devices. Probably, it is the reason why |
It explains why recompiling the same code on an access mode, which was previously compiled on a compute node, gave totally different results (in one of the posts above). |
Seems like I have to compile all the prerequisites ( Other issue is that I have |
|
Recompiling everything from source (except cuBLAS) on a compute node leads to a very strange performance of cublasGemmEx on a server with A100 GPUs.
Starpu-1.3 compiled on a host node:
The same device, but performance of 4096x5120 by 5120x5120 matrix multiplications (hash |
It's hard to comment on this without seeing what is happening around, such as with a paje trace. |
Here it is
I explicitly include |
And, for a reference, a paje.trace for access-node-compiled StarPU-1.4 (libnvidia-ml is disabled) host.paje.trace.tar.gz |
One thing I notice in the compute-node-built case is that there are a lot of 4µs "overhead" states here and there in the trace on the lower part of the T3* bars (below the CUDA* bars), which represents the state of the thread driving the gpu. These don't show up on the access-node-compiled case. I guess that could be some cuda operation triggered perhaps by the presence of nvidia-ml which for some reason takes a lot of time. Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is? Also, I notice that you have different-but-quite-close data sizes, and a lot of allocating/freeing states. You probably want to round up allocations to e.g. 10% of your data size, so that starpu can reuse data allocations rather than freeing/allocating all the time, that'll avoid a lot of synchronizations. I have just added a faq about it on https://gitlab.inria.fr/starpu/starpu/-/blob/master/doc/doxygen/chapters/starpu_faq/check_list_performance.doxy#L62 |
I have a pipeline of computations. Pipeline operates on tiles of shapes 1, 4096, 5120, 4096x5120, 5120x5120, and 4096x51200. Sizes never change. Shapes 4096x5120 and 5120x5120 are indeed close, but does data allocation reuse require ALL tiles to be of the same shape? That would be strange. |
Host (login node): Compute node: |
Adding such a sync only to a single gemm kernel did not change the picture much:
Without the sync:
Yes, performance got up twice, but it is still far from performance of StarPU-1.3 compiled on an access (login) node:
|
Actually, as you can see, there are only 3 different hashes of the gemm kernel. My tiles are really mostly 4096x5120 and 5120x5120. |
Thanks! Do you have |
I wasn't really planning for a performance increase, but mostly for more stable measurement. The deviation is really large. The |
No, but one cannot directly reuse the allocation for a different tile size, so if the global ratios of the different data shapes vary along the workload, one has to free/allocate to cope with the new ratios. That can explain the amount of reallocation. You may want to try to set |
Yes, it is nearly always on.
I tried latest starpu-1.4 commit and confirm performance model is now in a good shape:
before it was:
Now we are back to fight against the scheduler, that tries to transmit more data, than in StarPU-1.3 version. |
Preliminary tests on a previous week with these environment variables did not bring us performance. I will give it another try. |
At the beginning of execution the prefetch probably fights with eviction so that'd lose time, but we'd want to fix that at some point. I'm interested to see later in the execution, when there are much less ready tasks, thus much less prefetching and then no fight, we could hope for much less last-minute write-back. |
That is why I wonder #35 if there is a way to tell StarPU that a given handle can be assumed "dirty" from now on without reallocating resource as |
Setting these parameters enabled data race. Trace is attached: |
Besides triggering watchdog sometimes, this change could not help to increase performance with StarPU-1.4. As of now, performance of my app with a single data parallel track on a single GPU reaches 100 Tflops/s. When I switch to 4 independent data parallel tracks on 4 GPUs, performance goes up to 360 Tflops/s with StarPU-1.3 and remains 100 Tflops/s with StarPU-1.4. For some strange reason StarPU-1.4 communicates much more data through slow CPU-GPU PCI-express bus instead of fast SXM4 bus. I could not believe this is only due to scheduling technique. Maybe there is a double prefetching of the same buffer? Since issue with NUMA indexing is solved and performance of StarPU-1.4 is still much lower than of StarPU-1.3 I would like to continue search. I could send traces, but they weight around 1.8GB each. StarPU-1.4 (commit 159175aee64b7fa89f70b2ad6045d657fff1dc1a of gitlab):
Starpu-1.3 (commit 11699e22f3125723fb475e33797a6dcdaaecb7d7 of gitlab):
|
The Issue
On a GPU node when switching from StarPU version 1.3.11 to 1.4 versions we experience strange performance drop. For our new software NNTile it results in a 10x performance drop. Yes, it goes from 100% to only 10% percent.
Attempt to switch to a master branch (commit 50cf74508 at Inria gitlab repository) leads to different errors, related to data transfers between CPU and GPU. We tried some other commits from master branch and realized, that they only work with CPU and something strange with memory manager happens when it goes to GPU nodes. DARTS scheduler always fails, while DM and DMDA schedulers fail for some commits (e.g., 50cf74508) and work correctly for other commits (e.g., 2b8a91fe). I cannot present the output of master branch experiments right now, as this current issue is about performance degradation of 1.4 series of StarPU releases.
Although 10x performance drop happens on our new software, I prepared a simple example that shows performance for versions 1.2.10, 1.3.11 and 1.4.4. Most performance drop for the simple example happened when switching from 1.2.10 version to 1.3.11.
Steps to reproduce
I have implemented a simple test https://github.com/Muxas/starpu_gemm_redux to reproduce the issue. The repo simply implements several chains of matrix multiplications:
for
i
in range from0
toD-1
.which can be simply described with the following C code (the first order of task submissions):
or with the following C code (the other order of task submissions):
Matrices
A
are of sizeM-by-K
, matricesB
are of sizeK-by-N
and matrices C are of sizeM-by-N
. No transpositions in matrix multiplications.Our results are produced on a HGX node with 8 (eight) Nvidia A100 80GB SXM GPUs. We compiled the code and run two experimental setups:
M=N=K=1024, D=32, NB=100, R=50.
with and without STARPU_REDUX access mode for the matricesC
.M=256, N=K=1532, D=32, NB=100, R=50.
with and without STARPU_REDUX access mode for the matricesC
.StarPU-1.4.4 behavior
This section presents plots for the StarPU-1.4.4 version. The first plot shows warmup time (done by the first order of task submission), time for the first order of task submission and time for the other way of task submission with STARPU_RW|STARPU_COMMUTE access mode for the matrices
C
andM=N=K=1024
:The second plot shows the same timings but for the STARPU_REDUX access mode for the matrices
C
:The third plot shows timings for
M=256
andN=K=1532
with STARPU_RW|STARPU_COMMUTE access mode:And the last plot in this section (for the STARPU_REDUX access mode):
We see, that most dumb scheduling algorithm, namely
eager
, outperforms smarter ones.StarPU-1.3.11 behavior
This section presents plots for StarPU of version 1.3.11 in the same order as above.
We see, that most dumb scheduling algorithm, namely
eager
, outperforms smarter ones.StarPU-1.2.10 behavior
This section presents plots for StarPU of version 1.2.10 in the same order as above.
Here we see, that in case of STARPU_RW|STARPU_COMMUTE access mode smart schedulers DMDA and DMDAR perform nearly perfectly, just as EAGER. The problem with DMDA and DMDAR appears when switching to 1.3.11 or 1.4.4 StarPU version.
Configuration
The
configure
line we used is within config.log files in the section below.Configuration result
This is a config file for StarPU-1.2.10:
config-1.2.10.log
This is a config file for StarPU-1.3.11:
config-1.3.11.log
This is a config file for StarPU-1.4.4:
config-1.4.4.log
Distribution
Inria Gitlab repository
Version of StarPU
We used starpu-1.3.11 and starpu-1.4.4 tags of Inria GitLab repository
Version of GPU drivers
We use CUDA 12.3, hwloc 2.9.3
The text was updated successfully, but these errors were encountered: