OpenMP offload

To enable OpenMP offload to GPUs in QMCPACK, use the following cmake flag.

-DENABLE_OFFLOAD=1

Nvidia GPU

In conjunction with CUDA math libraries, add the following cmake flag.

-DENABLE_CUDA=1 # This is not the QMC_CUDA flag for the CUDA kernels.

NVHPC

21.7 have the following issues.

failing test_particle due to target nowait bug.
CPU. Numerics/Quadrature.h quadrature check failing due to bad vectorization
std::min offload region is bad. Use #define MIN(a,b) ((a) <= (b) ? (a) : (b)) instead.
deterministic-diamondC_2x1x1_pp-vmcbatch-dmcbatch-mwalkers_sdbatch_sdj-1-4 and 1-16 not passing

XL

XL 16.1.1 Linux version is not fully C++14 compliant but enough for the current QMCPACK needs. Use -qxflag=disable__cplusplusOverride to override C++ macro and use C++14 features. Use the following cmake line on Summit P9+V100

cmake -DCMAKE_C_COMPILER=mpixlc -DCMAKE_CXX_COMPILER=mpixlC \
      -DENABLE_OFFLOAD=1 -DENABLE_CUDA=1 \
      -DCMAKE_CXX_FLAGS="-qxflag=disable__cplusplusOverride -isystem /sw/summit/gcc/6.4.0/include/c++/6.4.0/powerpc64le-none-linux-gnu -qgcc_cpp_stdinc=/sw/summit/gcc/6.4.0/include/c++/6.4.0" \
      -DCMAKE_CXX_STANDARD_LIBRARIES=/sw/summit/gcc/6.4.0/lib64/libstdc++.a \
      ..

To get register usage, smem:

Add -Xptxas -v to CMAKE_CXX_FLAGS to print per cpp
Add -Xnvcc --nvlink-options=-v to CMAKE_EXE_LINKER_FLAGS to print at linking

Clang

Although LLVM Clang compiler supports OpenMP offload. There are few outstanding bugs causing it not being able to compile and run QMCPACK. Known issues:

~~Only support CUDA 10.0 and below. https://bugs.llvm.org/show_bug.cgi?id=44587~~ Need to build libomptarget with Clang 10.
~~cmath/math.h header file conflict affecting x86 not ppc64le. https://bugs.llvm.org/show_bug.cgi?id=42061, https://bugs.llvm.org/show_bug.cgi?id=42798, https://bugs.llvm.org/show_bug.cgi?id=42799~~ to be released in Clang 11.
Static linking fat binary is still broken and causes runtime error. https://bugs.llvm.org/show_bug.cgi?id=42395 and https://bugs.llvm.org/show_bug.cgi?id=38703. We have a workaround, add -DUSE_OBEJCT_TARGET=ON in cmake.
~~The offload library is single threaded and uses the default stream CUDA stream which constrains performance. http://lists.llvm.org/pipermail/openmp-dev/2019-December/002986.html~~ Some level multi-stream support is available in libomptarget to be released in clang 11.
~~(only checked with Clang8, not recently due to 1,2,3 issues) when OpenMP offload and CUDA are both enabled with the Clang compiler, there is some CUDA execution failure on X86_64~~ to be released in Clang 11.
~~offloading from multiple host threads causes data race. https://bugs.llvm.org/show_bug.cgi?id=46257~~ to be released in Clang 11

To get register usage, smem:

Add -Xcuda-ptxas -v to CMAKE_CXX_FLAGS to print per cpp
Add -v to CMAKE_EXE_LINKER_FLAGS to print at linking

For debugging or profiling

-Xcuda-ptxas --generate-line-info to CMAKE_CXX_FLAGS
--cuda-noopt-device-debug to CMAKE_CXX_FLAGS

Cray

Clang derived Cray compilers 9.0 can compile but cannot link QMCPACK.

cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
      -DENABLE_OFFLOAD=1 -DENABLE_CUDA=1 \
      -DQMC_MIXED_PRECISION=1 -DCUDA_ARCH=sm_70 \
      -DCUDA_HOST_COMPILER=`which gcc` -DENABLE_TIMERS=1 ..

Known issues:

Clang issue #2 affects Cray 9.1 and 10.
Fat binary linker error modf, sincos, sincosf with Cray 9.0

@E@nvlink error   : Undefined reference to 'modf' in '/tmp/cooltmp-fed625/tmp_cce_omp_offload_linkerlibqmcwfs.a__SplineC2ROMP.cpp.o__sec.cubin'

Only default stream is used in Cray 9.0 OpenMP runtime library.

AMD GPU

AOMP

Using AOMP compiler. Verified with 0.7-6 release and Radeon VII.

cmake -D CMAKE_C_COMPILER=/usr/lib/aomp/bin/clang  -D CMAKE_CXX_COMPILER=/usr/lib/aomp/bin/clang++ \
      -D ENABLE_OFFLOAD=1 \
      -D OFFLOAD_TARGET=amdgcn-amd-amdhsa \
      -D OFFLOAD_ARCH=gfx906 \
      -D QMC_MPI=0 ..

Due to Clang issue 4 5, libomptarget is only safe to work with 1 thread. AOMP supports multiple GPU queues and the data race in libomptarget causes multi-threaded run to fail. https://github.com/ROCm-Developer-Tools/aomp/issues/23
Excessive use of register reduces performance https://github.com/ROCm-Developer-Tools/aomp/issues/24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly