Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

sebrowne
Copy link
Contributor

@sebrowne sebrowne commented Sep 10, 2024

@trilinos/framework

Motivation

Want to align the CUDA AT2 build with the old AutoTester one.

Related Issues

https://sems-atlassian-son.sandia.gov/jira/browse/TRILFRAME-673

@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) labels Sep 10, 2024
Which will also cause it to start running all of the appropriate tests.
If I remember correctly, we had this disabled because the containers
were running out of disk space, but we want this enabled for the "real"
PR configuration.

Signed-off-by: Samuel E. Browne <[email protected]>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 10, 2024
Signed-off-by: Samuel E. Browne <[email protected]>
@sebrowne sebrowne requested a review from a team as a code owner September 10, 2024 03:01
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 10, 2024
We disable X11 everywhere else, so be consistent here.  In the future,
we probably want to enable this, since we DO have X11 in the containers,
but getting that hooked up and working is for another day.

Signed-off-by: Samuel E. Browne <[email protected]>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 11, 2024
@sebrowne
Copy link
Contributor Author

The CUDA tests look good, with four exceptions, detailed here: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=211376

@trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do.
@trilinos/panzer that test is on for all other configs, no obvious framework-side issues to me.
@trilinos/rol that test is on for all other configs, no obvious framework-side issues to me.
@trilinos/stratimikos we had that test disabled for our non-CUDA container as well, but again, nothing really obvious from our side.

If any developers from the tagged teams can provide any insight for the four failing tests (and they do fail reliably), it would be much appreciated! I can turn them off, but I wanted to at least do SOME due diligence and see what the community thinks.

@CamelliaDPG
Copy link
Contributor

@trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do.

Yes, please. The MonolithicExecutable test is one that has a lot of test cases, and some of them are intensive, so that sharing compute resources with other tests can lead to timeouts. We use RUN SERIAL to mitigate.

@rppawlo
Copy link
Contributor

rppawlo commented Sep 12, 2024

@cgcgcg - would you mind taking a look at the panzer/mini-em failure here? Looks to be a linear solver issue similar to what you have fixed in the past.

@cgcgcg
Copy link
Contributor

cgcgcg commented Sep 12, 2024

I see this message in the output of the failing Stratimikos and Panzer tests:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x42b363c80
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

For some reason, there are a couple of tests that are failing when RDMA
support is initialized.  I debugged it to the point of disabling the
smcuda BTL in OpenMPI. My guess is that something is wrong with our
container build of OpenMPI, OR there is something different
hardware-wise about our new Ampere80 machines (I checked the PCI bus
addresses because that was something that a brief Google investigation
indicated, but they didn't look any worse than the Volta70 machines).

Signed-off-by: Samuel E. Browne <[email protected]>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 16, 2024
@masterleinad
Copy link
Contributor

I see this message in the output of the failing Stratimikos and Panzer tests:

That looks like issues with cudaMallocAsync, see https://kokkos.org/kokkos-core-wiki/known-issues.html?highlight=known+issues#cuda, and https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1726216998539829.

@cgcgcg
Copy link
Contributor

cgcgcg commented Sep 17, 2024

@sebrowne Do we set Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF for our Cuda builds?

@sebrowne
Copy link
Contributor Author

We do not. I did do some debuggery and that particular error went away when I disabled the smcuda btl in OpenMPI. I can try that option as well (the test I was using to debug did still fail my way, but it was a NaN in Belos instead of the CUDA traceback).

@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 17, 2024
Signed-off-by: Samuel E. Browne <[email protected]>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 18, 2024
@sebrowne
Copy link
Contributor Author

New results with the Kokkos option, my disable of the smcuda BTL, and running the Intrepid2 test serially: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=217579

Seeing the same tests fail (except for the Intrepid2 one), but in perhaps more-straightforwards way? I see NaN errors from Belos.

@cgcgcg
Copy link
Contributor

cgcgcg commented Sep 18, 2024

@sebrowne Thanks for adding the option. Seems like the message went away. I'll have another look to see what's wrong.

@jhux2
Copy link
Member

jhux2 commented Sep 18, 2024

Thanks for adding the option. Seems like the message went away. I'll have another look to see what's wrong.

Fyi, The option -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF resolved the same warning I was seeing on the CEE lan, and more importantly it let to a 2x speedup.

@masterleinad
Copy link
Contributor

We decided to make -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF the default again in Kokkos (but that will likely only be visible in Trilinos with the next release).

@cgcgcg
Copy link
Contributor

cgcgcg commented Sep 19, 2024

I just went down the rabbit hole. The issue in this test
https://sems-cdash-son.sandia.gov/cdash/test/8036828
is caused by a bug in OpenBLAS before 0.3.27
OpenMathLib/OpenBLAS@7e9b1c0
(Essentially, a double variable was wrongly interpreted as complex double and then the imaginary part was used in a division. If the imaginary part happened to be zero this resulted in a NaN.)
Updating the container to a more recent version of OpenBLAS should fix that test.

@sebrowne
Copy link
Contributor Author

Perfect, thank you so much! I'll get on fixing it ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) AT2-SpecialApprove (Beta) Special approval label for AT2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants