Update toolchains on tioga, lassen, ruby and poodle #1712

adrienbernede · 2024-08-06T14:25:59Z

This PR addresses most of #1683.

Remove Intel 19
Update to Intel 2023: Failing -> allowed to fail
~~Update corona to ROCm 6.0.2~~ -> same error as with tioga, need ROCm 6.1.x , reverted to ROCm 5.7.1
Update tioga to ROCm 6.1.2
Add OpenMP target job
Remove CUDA 10 job (blueos system default is now 11.2.0)
Update to cce 18: Failing -> allowed to fail + restore cce 17.0.1 for now.

❗ TODO before merging ❗:

Merge Config updates radiuss-spack-configs#108 and update submodule accordingly.
Fix error below -> Allowed to fail: jobs still running, fixes expected soon-ish.

Errors to investigate:

[email protected] on ruby and poodle:

Same only test failing on both machines.

278: [----------] 1 test from OpenMP/LaunchParamExptReduceSumBasicTest/2, where TypeParam = camp::list<long, double, camp::resources::v1::Host, camp::list<RAJA::LaunchPolicy<RAJA::policy::omp::omp_launch_t>, RAJA::LoopPolicy<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > >, RAJA::policy::omp::omp_reduce>
278: [ RUN      ] OpenMP/LaunchParamExptReduceSumBasicTest/2.ReduceSumBasicForall
278: /g/g20/rajasa/.jacamar-ci/builds/_zn1e2ze/000/gitlab/radiuss/RAJA/test/functional/launch/reduce-params/tests/test-launch-basic-param-expt-ReduceSum.hpp:71: Failure
278: Expected equality of these values:
278:   static_cast<DATA_TYPE>(sum)
278:     Which is: 24087
278:   ref_sum
278:     Which is: 101051
278: 
278: [  FAILED  ] OpenMP/LaunchParamExptReduceSumBasicTest/2.ReduceSumBasicForall, where TypeParam = camp::list<long, double, camp::resources::v1::Host, camp::list<RAJA::LaunchPolicy<RAJA::policy::omp::omp_launch_t>, RAJA::LoopPolicy<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > >, RAJA::policy::omp::omp_reduce> (27 ms)
278: [----------] 1 test from OpenMP/LaunchParamExptReduceSumBasicTest/2 (27 ms total)

[email protected] on tioga:

Only one test failing with several of these:

557: [----------] 1 test from BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10, where TypeParam = std::tuple<int, RAJA::policy::omp::omp_atomic>
557: [ RUN      ] BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10.BasicMinMaxs
557: /g/g20/rajasa/.jacamar-ci/builds/PxDL3V6B/001/gitlab/radiuss/RAJA/test/unit/atomic/test-atomic-ref-minmax.cpp:44: Failure
557: Expected equality of these values:
557:   result
557:     Which is: 87
557:   (T)91
557:     Which is: 91
557: 
557: [  FAILED  ] BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10.BasicMinMaxs, where TypeParam = std::tuple<int, RAJA::policy::omp::omp_atomic> (0 ms)
557: [----------] 1 test from BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10 (0 ms total)

Notes

Regarding clang 16.0.6 + gcc 11.2.1
- When attempting to build with clang 16.0.6 + gcc 11.2.1 and loading cuda 11.8.0 module, loading the cuda module does not appear to be enough. There exists a clang 16.0.6 + cuda 11.8.0 + gcc 11.2.1 wrapper on lassen, this wrapper ends up adding cuda to the LD_LIBRARY_PATH:
```
  export LIBRARY_PATH=${LIBRARY_PATH}:${CUDA_LIB}
  ADD_FOR_LINK="-L${CUDA_LIB} -L${CUDA_LIBDEVICE} $ADD_FOR_LINK"
  export PATH=$CUDA_BIN:$PATH
```
Normally, loading cuda 11.8.0 like we do should cause the wrapper to use it. The error we get is (in camp build):
```
     32    -- Creating BLT CUDA targets...
>> 33    CMake Error at /usr/tce/packages/cmake/cmake-3.23.1/share/cmake/Modules/CMakeDetermineCompilerId.cmake:743 (message):
   34      Compiling the CUDA compiler identification source file
   35      "CMakeCUDACompilerId.cu" failed.
   36
   37      Compiler: /usr/tce/packages/cuda/cuda-11.8.0/bin/nvcc
```
- We associate clang to xl fortran compilers. However xl + gcc 11.2.1 is only available associated to cuda 11.8.0.
  -> In conclusion, we use lc defined clang 16.0.6 + cuda 11.8.0 + gcc 11.2.1, associated with xlf 16.1.1.14 + cuda 11.8.0 + gcc 11.2.1 . It looks like we are more and more bound to use LC wrappers (after having to use them to enforce the gcc toolchain in spack context).

…ssen specs

…ed, module is not enough

…ming convention to match LC’s

… first

rhornung67 · 2024-08-09T15:31:04Z

@adrienbernede I will get back to you for a recommendation about testing OpenMP target.

…+ fix

adrienbernede · 2024-08-15T18:58:56Z

@rhornung67 I think at least some of the above failures should be addressed by the RAJA teams. For the others, we can decide to allow the jobs to fail.

rhornung67 · 2024-08-15T19:02:52Z

@adrienbernede the test failure on intel2023 (poodle and ruby) is a known issue. @artv3 is looking into it, I think. I haven't seen the cce18 failure before. Is that a new version of Cray compiler in our CI?

adrienbernede · 2024-08-15T20:36:49Z

Is that a new version of Cray compiler in our CI?

@rhornung67 Yes, It’s even the new default.

rhornung67 · 2024-08-15T21:23:55Z

Yikes! We'll look into the cce18 failure

rhornung67 · 2024-08-28T22:12:37Z

@adrienbernede we identified the cce18 failure as a compiler issue (we can reproduce outside of RAJA). A ticket has been submitted and it is being tracked by one of our HPE POCs. The Intel failures may also be compiler issues. The errors go away if we build with -O0 or -O1. We reported to LC and are waiting on their recommendation to address.

So for now, I think we go with allowing failures for cce18 and intel. Also, we should probably add cce17 back in until the cce18 issue is resolved.

adrienbernede · 2024-09-02T14:34:05Z

I just allowed intel and cce 18 jobs to fail, and added a cce 17 job just for RAJA (still using cce 18 for other jobs). Is that OK ?

Also, could you confirm that, on ruby and poodle, you want:

~shared +openmp +vectorization +tests applied to the intel shared job
~shared +openmp +omptask +tests applied to the clang and gcc shared jobs.

Also, do we still need to default to blt@develop in CI, if so, why ?

…oga, add cce 17 job on tioga

rhornung67 · 2024-09-03T15:54:31Z

@adrienbernede the changes you described make sense. We can return to not allowing failures after we have the issues resolved.

The specs for ruby and poodle you mention are good.

I don't know why we are defaulting to BLT@develop. I think you set that up a while ago. I think it makes sense to point to the BLT 0.6.2 release, which is what we are using in the RAJA submodule.

adrienbernede · 2024-09-10T07:55:11Z

@rhornung67 this is ready. Your approval being a month old I’d like a quick second look from you.

adrienbernede added 8 commits August 6, 2024 16:23

Update rocm and cce versions for both corona and tioga, updates of la…

6954818

…ssen specs

From RSC: Fix: add missing compilers and corresponding external packages

c7d4d4c

From RSC: Deactivate rocm 5.7 job on tioga

0b5e471

From RSC: Fix: need to point at compiler wrapper with cuda 11.8 defin…

c1170a1

…ed, module is not enough

From RSC: Fix: use wrapper with cuda 11.8 consistently + change in na…

b6e3645

…ming convention to match LC’s

Do not allow [email protected] jobs to fail on ruby and poodle

919ea3b

Merge branch 'develop' into woptim/rsc-update

bb35a59

From RSC: Add cuda to xl spec relying on LC wrapper with cuda

b1386d9

adrienbernede changed the title ~~Update rocm and cce versions for both corona and tioga, updates of la…~~ Update toolchains on tioga, lassen, ruby and poodle Aug 7, 2024

adrienbernede added 5 commits August 7, 2024 16:53

From RSC: Fix

949414a

From RSC: Clean drop of rocm 5.7.0 in favor on 5.7.1 on corona

b6357a5

Merge branch 'develop' into woptim/rsc-update

b3b7821

From RSC: Update cray-mpich and add rocm 6.2.0: only apply cray-mpich…

0260d5d

… first

Update rocm in tioga CI

81cc328

rhornung67 approved these changes Aug 9, 2024

View reviewed changes

adrienbernede changed the title ~~Update toolchains on tioga, lassen, ruby and poodle~~ [WIP] Update toolchains on tioga, lassen, ruby and poodle Aug 9, 2024

adrienbernede added 2 commits August 9, 2024 23:46

From RSC: Enforce coherency between rocm software stack and compiler …

d9fa65e

…+ fix

From RSC: Fix typo: rocm compiler is rocmcc

9865c3d

Merge branch 'develop' into woptim/rsc-update

f698da0

Merge branch 'develop' into woptim/rsc-update

9472f24

rhornung67 mentioned this pull request Aug 30, 2024

Update GitLab CI content to match Adrien's RAJA PR LLNL/RAJAPerf#477

Closed

adrienbernede and others added 2 commits September 2, 2024 16:40

Allow failure for intel jobs on ruby and poodle and cce 18 jobs on ti…

9161b2b

…oga, add cce 17 job on tioga

Merge branch 'develop' into woptim/rsc-update

6b30e1c

adrienbernede added 2 commits September 4, 2024 10:29

From RSC: Remove XL jobs from shared CI jobs

0c62ada

Remove XL jobs defined locally too

2a3ce5c

adrienbernede mentioned this pull request Sep 4, 2024

Config updates LLNL/radiuss-spack-configs#108

Merged

15 tasks

adrienbernede added 2 commits September 5, 2024 10:11

Point at main branch in RSC

ad48eb6

Do not enforce blt@develop anymore

ac40ebd

adrienbernede changed the title ~~[WIP] Update toolchains on tioga, lassen, ruby and poodle~~ Update toolchains on tioga, lassen, ruby and poodle Sep 6, 2024

Merge branch 'develop' into woptim/rsc-update

392072d

adrienbernede merged commit 3ec77a7 into develop Sep 13, 2024
26 checks passed

adrienbernede deleted the woptim/rsc-update branch September 13, 2024 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update toolchains on tioga, lassen, ruby and poodle #1712

Update toolchains on tioga, lassen, ruby and poodle #1712

adrienbernede commented Aug 6, 2024 •

edited

Loading

rhornung67 commented Aug 9, 2024

adrienbernede commented Aug 15, 2024

rhornung67 commented Aug 15, 2024

adrienbernede commented Aug 15, 2024 •

edited

Loading

rhornung67 commented Aug 15, 2024

rhornung67 commented Aug 28, 2024 •

edited

Loading

adrienbernede commented Sep 2, 2024

rhornung67 commented Sep 3, 2024

adrienbernede commented Sep 10, 2024

Update toolchains on tioga, lassen, ruby and poodle #1712

Update toolchains on tioga, lassen, ruby and poodle #1712

Conversation

adrienbernede commented Aug 6, 2024 • edited Loading

❗ TODO before merging ❗:

Errors to investigate:

Notes

rhornung67 commented Aug 9, 2024

adrienbernede commented Aug 15, 2024

rhornung67 commented Aug 15, 2024

adrienbernede commented Aug 15, 2024 • edited Loading

rhornung67 commented Aug 15, 2024

rhornung67 commented Aug 28, 2024 • edited Loading

adrienbernede commented Sep 2, 2024

rhornung67 commented Sep 3, 2024

adrienbernede commented Sep 10, 2024

adrienbernede commented Aug 6, 2024 •

edited

Loading

adrienbernede commented Aug 15, 2024 •

edited

Loading

rhornung67 commented Aug 28, 2024 •

edited

Loading