Skip to content
Ye Luo edited this page Sep 17, 2021 · 42 revisions

To enable OpenMP offload to GPUs in QMCPACK, use the following cmake flag.

-DENABLE_OFFLOAD=1

Nvidia GPU

In conjunction with CUDA math libraries, add the following cmake flag.

-DENABLE_CUDA=1 # This is not the QMC_CUDA flag for the CUDA kernels.

NVHPC

21.7 have the following issues.

  1. failing test_particle due to target nowait bug.
  2. CPU. Numerics/Quadrature.h quadrature check failing due to bad vectorization
  3. std::min offload region is bad. Use #define MIN(a,b) ((a) <= (b) ? (a) : (b)) instead.
  4. deterministic-diamondC_2x1x1_pp-vmcbatch-dmcbatch-mwalkers_sdbatch_sdj-1-4 and 1-16 not passing

XL

XL 16.1.1 Linux version is not fully C++14 compliant but enough for the current QMCPACK needs. Use -qxflag=disable__cplusplusOverride to override C++ macro and use C++14 features. Use the following cmake line on Summit P9+V100

cmake -DCMAKE_C_COMPILER=mpixlc -DCMAKE_CXX_COMPILER=mpixlC \
      -DENABLE_OFFLOAD=1 -DENABLE_CUDA=1 \
      -DCMAKE_CXX_FLAGS="-qxflag=disable__cplusplusOverride -isystem /sw/summit/gcc/6.4.0/include/c++/6.4.0/powerpc64le-none-linux-gnu -qgcc_cpp_stdinc=/sw/summit/gcc/6.4.0/include/c++/6.4.0" \
      -DCMAKE_CXX_STANDARD_LIBRARIES=/sw/summit/gcc/6.4.0/lib64/libstdc++.a \
      ..

To get register usage, smem:

  1. Add -Xptxas -v to CMAKE_CXX_FLAGS to print per cpp
  2. Add -Xnvcc --nvlink-options=-v to CMAKE_EXE_LINKER_FLAGS to print at linking

Clang

Although LLVM Clang compiler supports OpenMP offload. There are few outstanding bugs causing it not being able to compile and run QMCPACK. Known issues:

  1. Only support CUDA 10.0 and below. https://bugs.llvm.org/show_bug.cgi?id=44587 Need to build libomptarget with Clang 10.
  2. cmath/math.h header file conflict affecting x86 not ppc64le. https://bugs.llvm.org/show_bug.cgi?id=42061, https://bugs.llvm.org/show_bug.cgi?id=42798, https://bugs.llvm.org/show_bug.cgi?id=42799 to be released in Clang 11.
  3. Static linking fat binary is still broken and causes runtime error. https://bugs.llvm.org/show_bug.cgi?id=42395 and https://bugs.llvm.org/show_bug.cgi?id=38703. We have a workaround, add -DUSE_OBEJCT_TARGET=ON in cmake.
  4. The offload library is single threaded and uses the default stream CUDA stream which constrains performance. http://lists.llvm.org/pipermail/openmp-dev/2019-December/002986.html Some level multi-stream support is available in libomptarget to be released in clang 11.
  5. (only checked with Clang8, not recently due to 1,2,3 issues) when OpenMP offload and CUDA are both enabled with the Clang compiler, there is some CUDA execution failure on X86_64 to be released in Clang 11.
  6. offloading from multiple host threads causes data race. https://bugs.llvm.org/show_bug.cgi?id=46257 to be released in Clang 11

To get register usage, smem:

  1. Add -Xcuda-ptxas -v to CMAKE_CXX_FLAGS to print per cpp
  2. Add -v to CMAKE_EXE_LINKER_FLAGS to print at linking

For debugging or profiling

  1. -Xcuda-ptxas --generate-line-info to CMAKE_CXX_FLAGS
  2. --cuda-noopt-device-debug to CMAKE_CXX_FLAGS

Cray

Clang derived Cray compilers 9.0 can compile but cannot link QMCPACK.

cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
      -DENABLE_OFFLOAD=1 -DENABLE_CUDA=1 \
      -DQMC_MIXED_PRECISION=1 -DCUDA_ARCH=sm_70 \
      -DCUDA_HOST_COMPILER=`which gcc` -DENABLE_TIMERS=1 ..

Known issues:

  1. Clang issue #2 affects Cray 9.1 and 10.
  2. Fat binary linker error modf, sincos, sincosf with Cray 9.0
@E@nvlink error   : Undefined reference to 'modf' in '/tmp/cooltmp-fed625/tmp_cce_omp_offload_linkerlibqmcwfs.a__SplineC2ROMP.cpp.o__sec.cubin'

  1. Only default stream is used in Cray 9.0 OpenMP runtime library.

AMD GPU

AOMP

Using AOMP compiler. Verified with 0.7-6 release and Radeon VII.

cmake -D CMAKE_C_COMPILER=/usr/lib/aomp/bin/clang  -D CMAKE_CXX_COMPILER=/usr/lib/aomp/bin/clang++ \
      -D ENABLE_OFFLOAD=1 \
      -D OFFLOAD_TARGET=amdgcn-amd-amdhsa \
      -D OFFLOAD_ARCH=gfx906 \
      -D QMC_MPI=0 ..
  1. Due to Clang issue 4 5, libomptarget is only safe to work with 1 thread. AOMP supports multiple GPU queues and the data race in libomptarget causes multi-threaded run to fail. https://github.com/ROCm-Developer-Tools/aomp/issues/23
  2. Excessive use of register reduces performance https://github.com/ROCm-Developer-Tools/aomp/issues/24