Handle GPUs that lack full NVML Support #16

ax3l · 2018-10-11T10:32:14Z

Nvidia NVML does not support non-Tesla product very well. Problems are known with mobile cards and even Quadro cards. (Reported as RFE to Nvidia as Bug ID 2417658.)

Anyway, this can lead to cuda_memtest throwing an [NVML] Error: Not supported (in nvmlDeviceGetSerial) exception which we should catch.

The text was updated successfully, but these errors were encountered:

berceanu · 2018-10-26T11:11:35Z

Testing on a GTX 950M, I get this while running PIConGPU:

</home/berceanu/src/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-7.3.0/picongpu-0.4.0-lqbxwsudtgms2do4ksm57uovvv4ypx4e/thirdParty/cuda_memtest/misc.cpp>:35

It seems to be just a warning, as the simulation completes after that.

See that disabling the memtest fixes it:

pic-build -b "cuda:50" -c "-DCUDAMEMTEST_ENABLE=OFF"

Should we add a known issue in the docs for non-tesla cards?

ax3l · 2018-10-26T12:52:53Z

Thx for the report! Can you please post the warning? Is there a line missing?

berceanu · 2018-10-26T13:59:07Z

Nope, there is only that single line.

ax3l · 2018-11-02T13:02:54Z

Ah ok, but it does not abort, yes!

Ok, we have to clean up that macro, it should not randomly start to write to cerr:

cuda_memtest/cuda_memtest.h

Lines 146 to 150 in 7a585d5

    
           #define NVML_CHECK(cmd) {nvmlReturn_t returnVal = cmd; if(returnVal!=NVML_SUCCESS){std::cerr<<"<"<<__FILE__<<">:"<<__LINE__<<std::endl; throw std::runtime_error(std::string("[NVML] Error: ") + std::string(nvmlErrorString(returnVal)));}} 
        
           #define NVML_CHECK_MSG(cmd,msg) {nvmlReturn_t returnVal = cmd; if(returnVal!=NVML_SUCCESS){std::cerr<<"<"<<__FILE__<<">:"<<__LINE__<<msg<<std::endl; throw std::runtime_error(std::string("[NVML] Error: ") + std::string(nvmlErrorString(returnVal)));}} 
        
           #define NVML_CHECK_NO_EXCEP(cmd) {nvmlReturn_t returnVal = cmd; if(returnVal!=NVML_SUCCESS){std::cerr<<"[NVML] Error: <"<<__FILE__<<">:"<<__LINE__<<std::endl;}}

ax3l · 2018-11-02T13:11:28Z

I proposed a fix in #18 that should remove that noisy line from your output.
It can (rightfully) be ignored.

ax3l added the bug label Oct 11, 2018

ax3l mentioned this issue Oct 11, 2018

picongpu run on workstation ComputationalRadiationPhysics/picongpu#2704

Closed

ax3l self-assigned this Oct 11, 2018

ax3l mentioned this issue Oct 11, 2018

NVML: Handle Unsupported GPU Serial #17

Merged

psychocoderHPC closed this as completed in #17 Oct 11, 2018

ax3l changed the title ~~NVML Support~~ Handle GPUs that lack full NVML Support Oct 11, 2018

ax3l reopened this Oct 26, 2018

ax3l mentioned this issue Nov 2, 2018

NVML Checks: stderr Writing #18

Merged

Cherishty mentioned this issue Nov 29, 2018

Failed to collect metrics: nvml: Not Supported BugRoger/nvidia-exporter#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle GPUs that lack full NVML Support #16

Handle GPUs that lack full NVML Support #16

ax3l commented Oct 11, 2018 •

edited

Loading

berceanu commented Oct 26, 2018 •

edited

Loading

ax3l commented Oct 26, 2018

berceanu commented Oct 26, 2018

ax3l commented Nov 2, 2018

ax3l commented Nov 2, 2018

Handle GPUs that lack full NVML Support #16

Handle GPUs that lack full NVML Support #16

Comments

ax3l commented Oct 11, 2018 • edited Loading

berceanu commented Oct 26, 2018 • edited Loading

ax3l commented Oct 26, 2018

berceanu commented Oct 26, 2018

ax3l commented Nov 2, 2018

ax3l commented Nov 2, 2018

ax3l commented Oct 11, 2018 •

edited

Loading

berceanu commented Oct 26, 2018 •

edited

Loading