Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle GPUs that lack full NVML Support #16

Open
ax3l opened this issue Oct 11, 2018 · 5 comments
Open

Handle GPUs that lack full NVML Support #16

ax3l opened this issue Oct 11, 2018 · 5 comments
Assignees
Labels

Comments

@ax3l
Copy link
Member

ax3l commented Oct 11, 2018

Nvidia NVML does not support non-Tesla product very well. Problems are known with mobile cards and even Quadro cards. (Reported as RFE to Nvidia as Bug ID 2417658.)

Anyway, this can lead to cuda_memtest throwing an [NVML] Error: Not supported (in nvmlDeviceGetSerial) exception which we should catch.

@ax3l ax3l added the bug label Oct 11, 2018
@ax3l ax3l self-assigned this Oct 11, 2018
@ax3l ax3l changed the title NVML Support Handle GPUs that lack full NVML Support Oct 11, 2018
@berceanu
Copy link

berceanu commented Oct 26, 2018

Testing on a GTX 950M, I get this while running PIConGPU:

</home/berceanu/src/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-7.3.0/picongpu-0.4.0-lqbxwsudtgms2do4ksm57uovvv4ypx4e/thirdParty/cuda_memtest/misc.cpp>:35

It seems to be just a warning, as the simulation completes after that.

See that disabling the memtest fixes it:

pic-build -b "cuda:50" -c "-DCUDAMEMTEST_ENABLE=OFF"

Should we add a known issue in the docs for non-tesla cards?

@ax3l
Copy link
Member Author

ax3l commented Oct 26, 2018

Thx for the report! Can you please post the warning? Is there a line missing?

@ax3l ax3l reopened this Oct 26, 2018
@berceanu
Copy link

Nope, there is only that single line.

@ax3l
Copy link
Member Author

ax3l commented Nov 2, 2018

Ah ok, but it does not abort, yes!

Ok, we have to clean up that macro, it should not randomly start to write to cerr:

cuda_memtest/cuda_memtest.h

Lines 146 to 150 in 7a585d5

#define NVML_CHECK(cmd) {nvmlReturn_t returnVal = cmd; if(returnVal!=NVML_SUCCESS){std::cerr<<"<"<<__FILE__<<">:"<<__LINE__<<std::endl; throw std::runtime_error(std::string("[NVML] Error: ") + std::string(nvmlErrorString(returnVal)));}}
#define NVML_CHECK_MSG(cmd,msg) {nvmlReturn_t returnVal = cmd; if(returnVal!=NVML_SUCCESS){std::cerr<<"<"<<__FILE__<<">:"<<__LINE__<<msg<<std::endl; throw std::runtime_error(std::string("[NVML] Error: ") + std::string(nvmlErrorString(returnVal)));}}
#define NVML_CHECK_NO_EXCEP(cmd) {nvmlReturn_t returnVal = cmd; if(returnVal!=NVML_SUCCESS){std::cerr<<"[NVML] Error: <"<<__FILE__<<">:"<<__LINE__<<std::endl;}}

@ax3l
Copy link
Member Author

ax3l commented Nov 2, 2018

I proposed a fix in #18 that should remove that noisy line from your output.
It can (rightfully) be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants