Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

sebrowne · 2024-09-10T02:49:05Z

@trilinos/framework

Motivation

Want to align the CUDA AT2 build with the old AutoTester one.

Related Issues

https://sems-atlassian-son.sandia.gov/jira/browse/TRILFRAME-673

Which will also cause it to start running all of the appropriate tests. If I remember correctly, we had this disabled because the containers were running out of disk space, but we want this enabled for the "real" PR configuration. Signed-off-by: Samuel E. Browne <[email protected]>

Signed-off-by: Samuel E. Browne <[email protected]>

We disable X11 everywhere else, so be consistent here. In the future, we probably want to enable this, since we DO have X11 in the containers, but getting that hooked up and working is for another day. Signed-off-by: Samuel E. Browne <[email protected]>

sebrowne · 2024-09-12T17:26:32Z

The CUDA tests look good, with four exceptions, detailed here: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=211376

@trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do.
@trilinos/panzer that test is on for all other configs, no obvious framework-side issues to me.
@trilinos/rol that test is on for all other configs, no obvious framework-side issues to me.
@trilinos/stratimikos we had that test disabled for our non-CUDA container as well, but again, nothing really obvious from our side.

If any developers from the tagged teams can provide any insight for the four failing tests (and they do fail reliably), it would be much appreciated! I can turn them off, but I wanted to at least do SOME due diligence and see what the community thinks.

CamelliaDPG · 2024-09-12T17:37:34Z

@trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do.

Yes, please. The MonolithicExecutable test is one that has a lot of test cases, and some of them are intensive, so that sharing compute resources with other tests can lead to timeouts. We use RUN SERIAL to mitigate.

rppawlo · 2024-09-12T18:18:07Z

@cgcgcg - would you mind taking a look at the panzer/mini-em failure here? Looks to be a linear solver issue similar to what you have fixed in the past.

cgcgcg · 2024-09-12T19:23:46Z

I see this message in the output of the failing Stratimikos and Panzer tests:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x42b363c80
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

Signed-off-by: Samuel E. Browne <[email protected]>

For some reason, there are a couple of tests that are failing when RDMA support is initialized. I debugged it to the point of disabling the smcuda BTL in OpenMPI. My guess is that something is wrong with our container build of OpenMPI, OR there is something different hardware-wise about our new Ampere80 machines (I checked the PCI bus addresses because that was something that a brief Google investigation indicated, but they didn't look any worse than the Volta70 machines). Signed-off-by: Samuel E. Browne <[email protected]>

masterleinad · 2024-09-17T14:51:28Z

I see this message in the output of the failing Stratimikos and Panzer tests:

That looks like issues with cudaMallocAsync, see https://kokkos.org/kokkos-core-wiki/known-issues.html?highlight=known+issues#cuda, and https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1726216998539829.

cgcgcg · 2024-09-17T14:59:55Z

@sebrowne Do we set Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF for our Cuda builds?

sebrowne · 2024-09-17T19:24:41Z

We do not. I did do some debuggery and that particular error went away when I disabled the smcuda btl in OpenMPI. I can try that option as well (the test I was using to debug did still fail my way, but it was a NaN in Belos instead of the CUDA traceback).

Signed-off-by: Samuel E. Browne <[email protected]>

sebrowne · 2024-09-18T13:39:22Z

New results with the Kokkos option, my disable of the smcuda BTL, and running the Intrepid2 test serially: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=217579

Seeing the same tests fail (except for the Intrepid2 one), but in perhaps more-straightforwards way? I see NaN errors from Belos.

cgcgcg · 2024-09-18T13:50:02Z

@sebrowne Thanks for adding the option. Seems like the message went away. I'll have another look to see what's wrong.

jhux2 · 2024-09-18T15:34:55Z

Thanks for adding the option. Seems like the message went away. I'll have another look to see what's wrong.

Fyi, The option -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF resolved the same warning I was seeing on the CEE lan, and more importantly it let to a 2x speedup.

masterleinad · 2024-09-18T16:03:16Z

We decided to make -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF the default again in Kokkos (but that will likely only be visible in Trilinos with the next release).

cgcgcg · 2024-09-19T16:17:49Z

I just went down the rabbit hole. The issue in this test
https://sems-cdash-son.sandia.gov/cdash/test/8036828
is caused by a bug in OpenBLAS before 0.3.27
OpenMathLib/OpenBLAS@7e9b1c0
(Essentially, a double variable was wrongly interpreted as complex double and then the imaginary part was used in a division. If the imaginary part happened to be zero this resulted in a NaN.)
Updating the container to a more recent version of OpenBLAS should fix that test.

sebrowne · 2024-09-19T18:07:01Z

Perfect, thank you so much! I'll get on fixing it ASAP.

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) labels Sep 10, 2024

sebrowne force-pushed the cuda-at2 branch from fab58d5 to b0b267c Compare September 10, 2024 02:50

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 10, 2024

Enable tests for CUDA AT2 config

86d144c

Signed-off-by: Samuel E. Browne <[email protected]>

sebrowne requested a review from a team as a code owner September 10, 2024 03:01

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 10, 2024

sebrowne added 2 commits September 11, 2024 07:32

Reduce build/test parallelism to align with resources

443ddc0

Signed-off-by: Samuel E. Browne <[email protected]>

Disable X11 for container config

757aaf3

We disable X11 everywhere else, so be consistent here. In the future, we probably want to enable this, since we DO have X11 in the containers, but getting that hooked up and working is for another day. Signed-off-by: Samuel E. Browne <[email protected]>

sebrowne force-pushed the cuda-at2 branch from e788f1d to 757aaf3 Compare September 11, 2024 13:33

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 11, 2024

sebrowne added 2 commits September 16, 2024 15:09

Add run-serial-tests to CUDA container configs

4a2d2a2

Signed-off-by: Samuel E. Browne <[email protected]>

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 16, 2024

Merge branch 'develop' into cuda-at2

7ff26dd

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 17, 2024

Try recommended Kokkos option

ce78412

Signed-off-by: Samuel E. Browne <[email protected]>

sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

sebrowne commented Sep 10, 2024 •

edited

Loading

sebrowne commented Sep 12, 2024

CamelliaDPG commented Sep 12, 2024

rppawlo commented Sep 12, 2024

cgcgcg commented Sep 12, 2024

masterleinad commented Sep 17, 2024

cgcgcg commented Sep 17, 2024

sebrowne commented Sep 17, 2024

sebrowne commented Sep 18, 2024

cgcgcg commented Sep 18, 2024

jhux2 commented Sep 18, 2024

masterleinad commented Sep 18, 2024

cgcgcg commented Sep 19, 2024

sebrowne commented Sep 19, 2024

Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

Are you sure you want to change the base?

Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

Conversation

sebrowne commented Sep 10, 2024 • edited Loading

Motivation

Related Issues

sebrowne commented Sep 12, 2024

CamelliaDPG commented Sep 12, 2024

rppawlo commented Sep 12, 2024

cgcgcg commented Sep 12, 2024

masterleinad commented Sep 17, 2024

cgcgcg commented Sep 17, 2024

sebrowne commented Sep 17, 2024

sebrowne commented Sep 18, 2024

cgcgcg commented Sep 18, 2024

jhux2 commented Sep 18, 2024

masterleinad commented Sep 18, 2024

cgcgcg commented Sep 19, 2024

sebrowne commented Sep 19, 2024

sebrowne commented Sep 10, 2024 •

edited

Loading