-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37
Comments
So I could write a MWE example performing a GEMM on CPU oder GPU. https://gist.github.com/grisuthedragon/0fa99935086a5945171ef63f185bbcee With
it works. With
it gives sometimes random errors. And with
permanently fails. Also turning off the CPUs let the code fail:
If it fails, it seems that the execution mostly got slow before. The same holds true for GCC: 12 |
Hello, I tried your MWE, but I'm getting
and indeed the codelet says |
Sorry, that was a copy and paste error. But changing the call in the Btw. I did not get your error message. |
If you configure with
I do get correct results on a 3-gpu machine, with various schedulers. |
I installed StarPU via Spack and disabled Here is my environment file for spack:
The code runs on
|
I did some additional test with varying BLAS implementations and got the following results Intel OneMKL + CUDA 12.2 + GCC12 + StarPU 1.4.4:
OpenBLAS 0.3.26 + CUDA 12.2 + GCC12 + StarPU 1.4.4:
@sthibaul Can you give some more details about your environment? |
I used this source: @ with this spec:
compiled with
ran with
On Centos 7.6.1810, with two gpus, without any error. I tried to add Note: the mkl/openblas library probably doesn't matter since you said it was when adding gpus that you had issues. You can even try with |
I further look what happens and I upgraded to CUDA 12.4 to match your environment. I further organized an older system with two P100 instead of two to four A100 cards and there no error appears. Back to the A100 system I get...
but running with
I get dozens of errors like
The reason seems that on the A100 cards, cutlass is used in GEMM operations, on P100 not. |
I updated my installation to StarPU 1.4.6 and CUDA 12.5. and run the "dgemm" example from Now the following errors appear
or
|
I have fixed very related cases yesterday with 3b258cb620de7610f0b6fadaae959f1e173f0e34 ("Fix asynchronous partitioning with data without home node"), could you check against that version? |
I tried the dgemm example again on my hardware and still get errors with more than 2 A100 cards, but with a lower probability as it seems. If an error occur it looks like:
or
Especially the case of 3 GPUs seems to be still affected by this problem. Regarding my own GEMM Code, posted above, it still fails but in addition to the above error, the following appear as well:
|
This is very questioning as reported error... Could you try through |
I've been running dgemm in a loop with 4 A100 on cuda 12.0 here, no error for half an hour |
Did you try without On tests as simple as |
With the current master, the example dgemm works, but my async partitioned one ends with
using only one GPU. Using 2, 3, or 4 GPUs the error stays the same. |
Any news on this? I did some tests with CUDA 12.6.2 on P100/A100/H100 cards and the problem is still there. |
It seems that during the updates introduces between 1.3 and 1.4, the asynchronous partitioning is broken. In basic, we have a code
The submit / unsubmit we leave to the STARPU runtime. The kernels required for the computing the task are available as CPU and CUDA implementation. Now we observed the following cases.
StarPU 1.3.11 / CUDA 11.8 / GCC 12
StarPU 1.3.11 / CUDA 12.2/ GCC 12
StarPU 1.4.4 / CUDA 12.2/ GCC 12
The tasks are only gemm operations from CUBLAS or MKL.
Due to ongoing research, I could not share the code and does not have time to build an MWE til now. But in general it seems to have something in common with https://gitlab.inria.fr/starpu/starpu/-/issues/43.
The text was updated successfully, but these errors were encountered: