Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update toolchains on tioga, lassen, ruby and poodle #1712

Merged
merged 24 commits into from
Sep 13, 2024

Conversation

adrienbernede
Copy link
Member

@adrienbernede adrienbernede commented Aug 6, 2024

This PR addresses most of #1683.

  • Remove Intel 19
  • Update to Intel 2023: Failing -> allowed to fail
  • Update corona to ROCm 6.0.2 -> same error as with tioga, need ROCm 6.1.x , reverted to ROCm 5.7.1
  • Update tioga to ROCm 6.1.2
  • Add OpenMP target job
  • Remove CUDA 10 job (blueos system default is now 11.2.0)
  • Update to cce 18: Failing -> allowed to fail + restore cce 17.0.1 for now.

❗ TODO before merging ❗:

Errors to investigate:

Same only test failing on both machines.

278: [----------] 1 test from OpenMP/LaunchParamExptReduceSumBasicTest/2, where TypeParam = camp::list<long, double, camp::resources::v1::Host, camp::list<RAJA::LaunchPolicy<RAJA::policy::omp::omp_launch_t>, RAJA::LoopPolicy<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > >, RAJA::policy::omp::omp_reduce>
278: [ RUN      ] OpenMP/LaunchParamExptReduceSumBasicTest/2.ReduceSumBasicForall
278: /g/g20/rajasa/.jacamar-ci/builds/_zn1e2ze/000/gitlab/radiuss/RAJA/test/functional/launch/reduce-params/tests/test-launch-basic-param-expt-ReduceSum.hpp:71: Failure
278: Expected equality of these values:
278:   static_cast<DATA_TYPE>(sum)
278:     Which is: 24087
278:   ref_sum
278:     Which is: 101051
278: 
278: [  FAILED  ] OpenMP/LaunchParamExptReduceSumBasicTest/2.ReduceSumBasicForall, where TypeParam = camp::list<long, double, camp::resources::v1::Host, camp::list<RAJA::LaunchPolicy<RAJA::policy::omp::omp_launch_t>, RAJA::LoopPolicy<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > >, RAJA::policy::omp::omp_reduce> (27 ms)
278: [----------] 1 test from OpenMP/LaunchParamExptReduceSumBasicTest/2 (27 ms total)

Only one test failing with several of these:

557: [----------] 1 test from BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10, where TypeParam = std::tuple<int, RAJA::policy::omp::omp_atomic>
557: [ RUN      ] BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10.BasicMinMaxs
557: /g/g20/rajasa/.jacamar-ci/builds/PxDL3V6B/001/gitlab/radiuss/RAJA/test/unit/atomic/test-atomic-ref-minmax.cpp:44: Failure
557: Expected equality of these values:
557:   result
557:     Which is: 87
557:   (T)91
557:     Which is: 91
557: 
557: [  FAILED  ] BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10.BasicMinMaxs, where TypeParam = std::tuple<int, RAJA::policy::omp::omp_atomic> (0 ms)
557: [----------] 1 test from BasicMinMaxUnitTest/AtomicRefBasicMinMaxUnitTest/10 (0 ms total)

Notes

  • Regarding clang 16.0.6 + gcc 11.2.1
    • When attempting to build with clang 16.0.6 + gcc 11.2.1 and loading cuda 11.8.0 module, loading the cuda module does not appear to be enough. There exists a clang 16.0.6 + cuda 11.8.0 + gcc 11.2.1 wrapper on lassen, this wrapper ends up adding cuda to the LD_LIBRARY_PATH:
      export LIBRARY_PATH=${LIBRARY_PATH}:${CUDA_LIB}
      ADD_FOR_LINK="-L${CUDA_LIB} -L${CUDA_LIBDEVICE} $ADD_FOR_LINK"
      export PATH=$CUDA_BIN:$PATH
    
    Normally, loading cuda 11.8.0 like we do should cause the wrapper to use it. The error we get is (in camp build):
         32    -- Creating BLT CUDA targets...
    >> 33    CMake Error at /usr/tce/packages/cmake/cmake-3.23.1/share/cmake/Modules/CMakeDetermineCompilerId.cmake:743 (message):
       34      Compiling the CUDA compiler identification source file
       35      "CMakeCUDACompilerId.cu" failed.
       36
       37      Compiler: /usr/tce/packages/cuda/cuda-11.8.0/bin/nvcc
    
    • We associate clang to xl fortran compilers. However xl + gcc 11.2.1 is only available associated to cuda 11.8.0.
      -> In conclusion, we use lc defined clang 16.0.6 + cuda 11.8.0 + gcc 11.2.1, associated with xlf 16.1.1.14 + cuda 11.8.0 + gcc 11.2.1 . It looks like we are more and more bound to use LC wrappers (after having to use them to enforce the gcc toolchain in spack context).

@adrienbernede adrienbernede changed the title Update rocm and cce versions for both corona and tioga, updates of la… Update toolchains on tioga, lassen, ruby and poodle Aug 7, 2024
@rhornung67
Copy link
Member

@adrienbernede I will get back to you for a recommendation about testing OpenMP target.

@adrienbernede adrienbernede changed the title Update toolchains on tioga, lassen, ruby and poodle [WIP] Update toolchains on tioga, lassen, ruby and poodle Aug 9, 2024
@adrienbernede
Copy link
Member Author

@rhornung67 I think at least some of the above failures should be addressed by the RAJA teams. For the others, we can decide to allow the jobs to fail.

@rhornung67
Copy link
Member

@adrienbernede the test failure on intel2023 (poodle and ruby) is a known issue. @artv3 is looking into it, I think. I haven't seen the cce18 failure before. Is that a new version of Cray compiler in our CI?

@adrienbernede
Copy link
Member Author

adrienbernede commented Aug 15, 2024

Is that a new version of Cray compiler in our CI?

@rhornung67 Yes, It’s even the new default.

@rhornung67
Copy link
Member

Yikes! We'll look into the cce18 failure

@rhornung67
Copy link
Member

rhornung67 commented Aug 28, 2024

@adrienbernede we identified the cce18 failure as a compiler issue (we can reproduce outside of RAJA). A ticket has been submitted and it is being tracked by one of our HPE POCs. The Intel failures may also be compiler issues. The errors go away if we build with -O0 or -O1. We reported to LC and are waiting on their recommendation to address.

So for now, I think we go with allowing failures for cce18 and intel. Also, we should probably add cce17 back in until the cce18 issue is resolved.

@adrienbernede
Copy link
Member Author

I just allowed intel and cce 18 jobs to fail, and added a cce 17 job just for RAJA (still using cce 18 for other jobs). Is that OK ?

Also, could you confirm that, on ruby and poodle, you want:

  • ~shared +openmp +vectorization +tests applied to the intel shared job
  • ~shared +openmp +omptask +tests applied to the clang and gcc shared jobs.

Also, do we still need to default to blt@develop in CI, if so, why ?

@rhornung67
Copy link
Member

@adrienbernede the changes you described make sense. We can return to not allowing failures after we have the issues resolved.

The specs for ruby and poodle you mention are good.

I don't know why we are defaulting to BLT@develop. I think you set that up a while ago. I think it makes sense to point to the BLT 0.6.2 release, which is what we are using in the RAJA submodule.

@adrienbernede adrienbernede changed the title [WIP] Update toolchains on tioga, lassen, ruby and poodle Update toolchains on tioga, lassen, ruby and poodle Sep 6, 2024
@adrienbernede
Copy link
Member Author

@rhornung67 this is ready. Your approval being a month old I’d like a quick second look from you.

@adrienbernede adrienbernede merged commit 3ec77a7 into develop Sep 13, 2024
26 checks passed
@adrienbernede adrienbernede deleted the woptim/rsc-update branch September 13, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants