-
Notifications
You must be signed in to change notification settings - Fork 2
OpenMP offload
To enable OpenMP offload to GPUs in QMCPACK, use the following cmake flag.
-DENABLE_OFFLOAD=1
In conjunction with CUDA math libraries, add the following cmake flag.
-DENABLE_CUDA=1 # This is not the QMC_CUDA flag for the CUDA kernels.
21.7 have the following issues.
- failing test_particle due to target nowait bug.
- CPU. Numerics/Quadrature.h quadrature check failing due to bad vectorization
- std::min offload region is bad. Use
#define MIN(a,b) ((a) <= (b) ? (a) : (b))
instead. - deterministic-diamondC_2x1x1_pp-vmcbatch-dmcbatch-mwalkers_sdbatch_sdj-1-4 and 1-16 not passing
XL 16.1.1 Linux version is not fully C++14 compliant but enough for the current QMCPACK needs. Use -qxflag=disable__cplusplusOverride
to override C++ macro and use C++14 features.
Use the following cmake line on Summit P9+V100
cmake -DCMAKE_C_COMPILER=mpixlc -DCMAKE_CXX_COMPILER=mpixlC \
-DENABLE_OFFLOAD=1 -DENABLE_CUDA=1 \
-DCMAKE_CXX_FLAGS="-qxflag=disable__cplusplusOverride -isystem /sw/summit/gcc/6.4.0/include/c++/6.4.0/powerpc64le-none-linux-gnu -qgcc_cpp_stdinc=/sw/summit/gcc/6.4.0/include/c++/6.4.0" \
-DCMAKE_CXX_STANDARD_LIBRARIES=/sw/summit/gcc/6.4.0/lib64/libstdc++.a \
..
To get register usage, smem:
- Add -Xptxas -v to CMAKE_CXX_FLAGS to print per cpp
- Add -Xnvcc --nvlink-options=-v to CMAKE_EXE_LINKER_FLAGS to print at linking
Although LLVM Clang compiler supports OpenMP offload. There are few outstanding bugs causing it not being able to compile and run QMCPACK. Known issues:
-
Only support CUDA 10.0 and below. https://bugs.llvm.org/show_bug.cgi?id=44587Need to build libomptarget with Clang 10. -
cmath/math.h header file conflict affecting x86 not ppc64le. https://bugs.llvm.org/show_bug.cgi?id=42061, https://bugs.llvm.org/show_bug.cgi?id=42798, https://bugs.llvm.org/show_bug.cgi?id=42799to be released in Clang 11. - Static linking fat binary is still broken and causes runtime error. https://bugs.llvm.org/show_bug.cgi?id=42395 and https://bugs.llvm.org/show_bug.cgi?id=38703. We have a workaround, add -DUSE_OBEJCT_TARGET=ON in cmake.
-
The offload library is single threaded and uses the default stream CUDA stream which constrains performance. http://lists.llvm.org/pipermail/openmp-dev/2019-December/002986.htmlSome level multi-stream support is available in libomptarget to be released in clang 11. -
(only checked with Clang8, not recently due to 1,2,3 issues) when OpenMP offload and CUDA are both enabled with the Clang compiler, there is some CUDA execution failure on X86_64to be released in Clang 11. -
offloading from multiple host threads causes data race. https://bugs.llvm.org/show_bug.cgi?id=46257to be released in Clang 11
To get register usage, smem:
- Add -Xcuda-ptxas -v to CMAKE_CXX_FLAGS to print per cpp
- Add -v to CMAKE_EXE_LINKER_FLAGS to print at linking
For debugging or profiling
- -Xcuda-ptxas --generate-line-info to CMAKE_CXX_FLAGS
- --cuda-noopt-device-debug to CMAKE_CXX_FLAGS
Clang derived Cray compilers 9.0 can compile but cannot link QMCPACK.
cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
-DENABLE_OFFLOAD=1 -DENABLE_CUDA=1 \
-DQMC_MIXED_PRECISION=1 -DCUDA_ARCH=sm_70 \
-DCUDA_HOST_COMPILER=`which gcc` -DENABLE_TIMERS=1 ..
Known issues:
- Clang issue #2 affects Cray 9.1 and 10.
- Fat binary linker error modf, sincos, sincosf with Cray 9.0
@E@nvlink error : Undefined reference to 'modf' in '/tmp/cooltmp-fed625/tmp_cce_omp_offload_linkerlibqmcwfs.a__SplineC2ROMP.cpp.o__sec.cubin'
- Only default stream is used in Cray 9.0 OpenMP runtime library.
Using AOMP compiler. Verified with 0.7-6 release and Radeon VII.
cmake -D CMAKE_C_COMPILER=/usr/lib/aomp/bin/clang -D CMAKE_CXX_COMPILER=/usr/lib/aomp/bin/clang++ \
-D ENABLE_OFFLOAD=1 \
-D OFFLOAD_TARGET=amdgcn-amd-amdhsa \
-D OFFLOAD_ARCH=gfx906 \
-D QMC_MPI=0 ..
- Due to Clang issue
45, libomptarget is only safe to work with 1 thread. AOMP supports multiple GPU queues and the data race in libomptarget causes multi-threaded run to fail. https://github.com/ROCm-Developer-Tools/aomp/issues/23 - Excessive use of register reduces performance https://github.com/ROCm-Developer-Tools/aomp/issues/24