Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA support issues on Power9+V100 systems #10

Open
nphtan opened this issue May 5, 2021 · 2 comments · May be fixed by #12
Open

CUDA support issues on Power9+V100 systems #10

nphtan opened this issue May 5, 2021 · 2 comments · May be fixed by #12

Comments

@nphtan
Copy link
Collaborator

nphtan commented May 5, 2021

I'm running into issues building with CUDA support on Power9. The platform is a dual socket Power9 node with 32 cores and 2 V100 GPUs per node. Building with CUDA support has 2 issues I've seen so far. The first is a simple mistake in ResCudaSpace.hpp(273) that generates a bunch of syntax errors.
...
[ 25%] Building CXX object CMakeFiles/resilience.dir/src/resilience/cuda/ResCuda.cpp.o
nvcc_wrapper has been given GNU extension standard flag -std=gnu++14 - reverting flag to -std=c++14
/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/cuda/ResCudaSpace.hpp(273): error: enable_if is not a template

The fix is to add std:: to both the enable_if and is_same template functions on line 273.

The second error comes further along when building the tests.

[ 50%] Building CXX object tests/CMakeFiles/resilience_tests.dir/TestResilience.cpp.o
/home/ntan1/KokkosResilience/kokkos/build/install/include/impl/Kokkos_Profiling_Interface.hpp(79): error: incomplete type is not allowed
detected during:
instantiation of "uint32_t Kokkos::Profiling::Experimental::device_id(const ExecutionSpace &) [with ExecutionSpace=KokkosResilience::ResCuda]"
/home/ntan1/KokkosResilience/kokkos/build/install/include/Kokkos_Parallel.hpp(171): here
instantiation of "void Kokkos::parallel_for(const ExecPolicy &, const FunctorType &, const std::cxx11::string &, std::enable_if<Kokkos::is_execution_policy::value, void>::type *) [with ExecPolicy=Kokkos::RangePolicyKokkosResilience::ResCuda, FunctorType=lambda ->void]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestResilience.cpp(93): here
instantiation of "void TestResilientRange<ExecSpace, ScheduleType, DataType>::test_for() [with ExecSpace=Kokkos::Serial, ScheduleType=Kokkos::ScheduleKokkos::Static, DataType=int]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestResilience.cpp(117): here
instantiation of "void TestResilience_range_Test<gtest_TypeParam
>::TestBody() [with gtest_TypeParam
=Kokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here
implicit generation of "TestResilience_range_Test<gtest_TypeParam
>::~TestResilience_range_Test() [with gtest_TypeParam_=Kokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here
[ 4 instantiation contexts not shown ]
implicit generation of "testing::internal::TestFactoryImpl::~TestFactoryImpl() [with TestClass=TestResilience_range_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestResilience_range_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
implicit generation of "testing::internal::TestFactoryImpl::TestFactoryImpl() [with TestClass=TestResilience_range_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestResilience_range_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of "__nv_bool testing::internal::TypeParameterizedTest<Fixture, TestSel, Types>::Register(const char *, const testing::internal::CodeLocation &, const char *, const char *, int, const std::vector<std::_cxx11::string, std::allocatorstd::__cxx11::string> &) [with Fixture=TestResilience, TestSel=testing::internal::TemplateSel<TestResilience_range_Test>, Types=gtest_type_params_TestResilience]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestResilience.cpp(110): here

1 error detected in the compilation of "/tmp/tmpxft_00002401_00000000-6_TestResilience.cpp1.ii".
make[2]: *** [tests/CMakeFiles/resilience_tests.dir/TestResilience.cpp.o] Error 1
make[1]: *** [tests/CMakeFiles/resilience_tests.dir/all] Error 2
make: *** [all] Error 2

I'm not sure how to fix this.

@nmm0
Copy link
Contributor

nmm0 commented May 11, 2021

TestResilience.cpp should currently be disabled, since it relies on code that is not implemented. Are there additional tests giving problems?

@nphtan
Copy link
Collaborator Author

nphtan commented May 13, 2021

There's a syntax bug in ResCudaSpace.hpp

diff --git a/src/resilience/cuda/ResCudaSpace.hpp b/src/resilience/cuda/ResCudaSpace.hpp
index 970151e..8fc3209 100644
--- a/src/resilience/cuda/ResCudaSpace.hpp
+++ b/src/resilience/cuda/ResCudaSpace.hpp
@@ -270,7 +270,7 @@ struct VerifyExecutionCanAccessMemorySpace< KokkosResilience::ResCudaSpace , Kok
/** Running in CudaSpace attempting to access an unknown space: error */
template< class OtherSpace >
struct VerifyExecutionCanAccessMemorySpace<

  • typename enable_if< ! is_sameKokkosResilience::ResCudaSpace,OtherSpace::value , KokkosResilience::ResCudaSpace >
  • typename std::enable_if< ! std::is_sameKokkosResilience::ResCudaSpace,OtherSpace::value , KokkosResilience::ResC
    OtherSpace >
    {
    enum { value = false };

With removed TestResilience.cpp and the syntax error fix the build fails while trying to make TestVelocMemoryBackend.cpp with the following errors.

/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/util/Trace.hpp(288): error: expression must have class type
detected during:
instantiation of "auto KokkosResilience::Util::begin_trace<TraceType,Context,Args...>(Context &, Args &&...) [with TraceType=KokkosResilience::Util::TimingTracestd::__cxx11::string, Context=const char [9], Args=<>]"
/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/AutomaticCheckpoint.hpp(132): here
instantiation of "void KokkosResilience::checkpoint(Context &, const std::cxx11::string &, int, F &&) [with Context=KokkosResilience::MPIContextKokkosResilience::VeloCMemoryBackend, F=lambda ->void]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(55): here
instantiation of "void TestVelocMemoryBackend::test_layout<Layout,Context>(Context &, std::size_t, std::size_t) [with ExecSpace=Kokkos::Serial, Layout=Kokkos::LayoutRight, Context=KokkosResilience::MPIContextKokkosResilience::VeloCMemoryBackend]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(112): here
instantiation of "void TestVelocMemoryBackend_veloc_mem_Test<gtest_TypeParam
>::TestBody() [with gtest_TypeParam
=Kokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here
implicit generation of "TestVelocMemoryBackend_veloc_mem_Test<gtest_TypeParam
>::~TestVelocMemoryBackend_veloc_mem_Test() [with gtest_TypeParam_=Kokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here
[ 4 instantiation contexts not shown ]
implicit generation of "testing::internal::TestFactoryImpl::~TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
implicit generation of "testing::internal::TestFactoryImpl::TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of "__nv_bool testing::internal::TypeParameterizedTest<Fixture, TestSel, Types>::Register(const char *, const testing::internal::CodeLocation &, const char *, const char *, int, const std::vector<std::_cxx11::string, std::allocatorstd::__cxx11::string> &) [with Fixture=TestVelocMemoryBackend, TestSel=testing::internal::TemplateSel<TestVelocMemoryBackend_veloc_mem_Test>, Types=gtest_type_params_TestVelocMemoryBackend]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(97): here

/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/util/Trace.hpp(290): error: expression must have class type
detected during:
instantiation of "auto KokkosResilience::Util::begin_trace<TraceType,Context,Args...>(Context &, Args &&...) [with TraceType=KokkosResilience::Util::TimingTracestd::__cxx11::string, Context=const char [9], Args=<>]"
/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/AutomaticCheckpoint.hpp(132): here
instantiation of "void KokkosResilience::checkpoint(Context &, const std::cxx11::string &, int, F &&) [with Context=KokkosResilience::MPIContextKokkosResilience::VeloCMemoryBackend, F=lambda ->void]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(55): here
instantiation of "void TestVelocMemoryBackend::test_layout<Layout,Context>(Context &, std::size_t, std::size_t) [with ExecSpace=Kokkos::Serial, Layout=Kokkos::LayoutRight, Context=KokkosResilience::MPIContextKokkosResilience::VeloCMemoryBackend]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(112): here
instantiation of "void TestVelocMemoryBackend_veloc_mem_Test<gtest_TypeParam
>::TestBody() [with gtest_TypeParam
=Kokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here
implicit generation of "TestVelocMemoryBackend_veloc_mem_Test<gtest_TypeParam
>::~TestVelocMemoryBackend_veloc_mem_Test() [with gtest_TypeParam_=Kokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here
[ 4 instantiation contexts not shown ]
implicit generation of "testing::internal::TestFactoryImpl::~TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
implicit generation of "testing::internal::TestFactoryImpl::TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_TestKokkos::Serial]"
/home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here
instantiation of "__nv_bool testing::internal::TypeParameterizedTest<Fixture, TestSel, Types>::Register(const char *, const testing::internal::CodeLocation &, const char *, const char *, int, const std::vector<std::_cxx11::string, std::allocatorstd::__cxx11::string> &) [with Fixture=TestVelocMemoryBackend, TestSel=testing::internal::TemplateSel<TestVelocMemoryBackend_veloc_mem_Test>, Types=gtest_type_params_TestVelocMemoryBackend]"
/home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(97): here

2 errors detected in the compilation of "/tmp/tmpxft_00010e8b_00000000-6_TestVelocMemoryBackend.cpp1.ii".
make[2]: *** [tests/CMakeFiles/resilience_tests.dir/TestVelocMemoryBackend.cpp.o] Error 1
make[1]: *** [tests/CMakeFiles/resilience_tests.dir/all] Error 2
make: *** [all] Error 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants