"out of memory" on V100 #15

ax3l · 2018-03-12T09:40:53Z

cuda_memtest seems to abort with "out of memory" (line 148 in cuda_memtests.cu) when run in a container (nvidia-docker1 and 2) on V100 GPUs.

The problem might be a general one or just triggered in PIConGPU. Needs investigation. Maybe just multiple-times assigned from mpiInfo...

Occurred with a 4 & 8 GPU PIConGPU lwfa example on a DGX-1.

The text was updated successfully, but these errors were encountered:

RenaKunisaki · 2019-05-17T19:34:24Z

I have the same problem and not using Docket:

~> ocl_memtest 
hostname is guilmon
CL_PLATFORM_NAME: 	NVIDIA CUDA
CL_PLATFORM_VERSION: 	OpenCL 1.2 CUDA 10.2.120
                  	Device 0 is CL_DEVICE_TYPE_GPU, "GeForce GTX 950"
allocated 340 Mbytes from device 0
[05/17/2019 15:33:40][guilmon][0]:Test0 [Walking 1 bit]
[05/17/2019 15:33:40][guilmon][0]:Test0: global walk test
ERROR: opencl call failed with rc(-5), line 39, file ocl_tests.cpp
Error: Out of resources

(Does that just mean the test failed?)

psychocoderHPC · 2019-05-17T19:51:14Z

@RenaKunisaki We never tested the opencl version of cuda_memtest.
Depending of the driver version OpenCL is not able to allocate 100% of the main gpu memory.
Could you rerun your your test with cuda_memtest?

ax3l · 2019-05-17T20:22:38Z

Also take care if your X server is running on the same device.

RenaKunisaki · 2019-05-17T23:19:23Z

I installed it from Arch package (AUR) and I don't seem to have cuda_memtest binary. I will try without X running though.

ax3l · 2019-05-20T08:06:53Z

Oh, if you are taking the aur package (here?) it will take the legacy sourceforge version. We haven't seen much activity on that one since years and thus update and fix our own forked CUDA version here.

If you find updates to the OpenCL version we will gladly review and merge pull requests.

ax3l added bug question labels Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"out of memory" on V100 #15

"out of memory" on V100 #15

ax3l commented Mar 12, 2018 •

edited

Loading

RenaKunisaki commented May 17, 2019

psychocoderHPC commented May 17, 2019

ax3l commented May 17, 2019 via email •

edited

Loading

RenaKunisaki commented May 17, 2019

ax3l commented May 20, 2019 •

edited

Loading

"out of memory" on V100 #15

"out of memory" on V100 #15

Comments

ax3l commented Mar 12, 2018 • edited Loading

RenaKunisaki commented May 17, 2019

psychocoderHPC commented May 17, 2019

ax3l commented May 17, 2019 via email • edited Loading

RenaKunisaki commented May 17, 2019

ax3l commented May 20, 2019 • edited Loading

ax3l commented Mar 12, 2018 •

edited

Loading

ax3l commented May 17, 2019 via email •

edited

Loading

ax3l commented May 20, 2019 •

edited

Loading