Torch CUDA package does not work on gpu-1-5 and gpu-2-7 #293

LijieTu · 2015-07-29T20:33:19Z

I try running torch CUDA package on some nodes, but something weird happens

First reserve a node:

qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu

Then run the docker image: kaixhin/cuda-torch
docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

3.Type in "qlua" or "th"
4. require 'torch'; require 'cutorch'

For gpu-1-5 and gpu-2-7, I got message like:
cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20

I checked the CUDA, it should be fine.
lijie@gpu-1-5 ~ > echo $CUDA_VISIBLE_DEVICES
3,2,1,0

This situation only happens on these two nodes, I tried the same on gpu-2-5, no errors.

The text was updated successfully, but these errors were encountered:

jchodera · 2015-07-29T20:44:42Z

Hm. gpu-1-5 appears to show some residual memory used on GPU 0:

[chodera@gpu-1-5 ~]$ nvidia-smi
Wed Jul 29 16:43:17 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    392MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   40C    P0    N/A /  N/A |    365MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 30%   43C    P0    N/A /  N/A |    365MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    365MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0            C+G   Not Supported                                         |
|    1            C+G   Not Supported                                         |
|    2            C+G   Not Supported                                         |
|    3            C+G   Not Supported                                         |
+-----------------------------------------------------------------------------+

but GPU-2-7 is actually in use right now:

[chodera@gpu-2-7 ~]$ nvidia-smi
Wed Jul 29 16:43:59 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 33%   49C    P0    70W / 250W |    432MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 35%   52C    P0    70W / 250W |    406MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 35%   53C    P0    70W / 250W |    406MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 36%   54C    P0    70W / 250W |    406MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     30604    C   qlua                                           415MiB |
|    1     30604    C   qlua                                           388MiB |
|    2     30604    C   qlua                                           388MiB |
|    3     30604    C   qlua                                           388MiB |
+-----------------------------------------------------------------------------+

jchodera · 2015-07-29T20:46:47Z

Ah, you are in fact running on gpu-2-7 right now!

jchodera · 2015-07-29T20:47:27Z

On gpu-2-7, what if you try setting CUDA_VISIBLE_DEVICES to 0,1,2,3 instead, or just clearing it altogether?

LijieTu · 2015-07-29T20:47:43Z

@jchodera Oh, I was testing on GPU-2-7, now I exit

LijieTu · 2015-07-29T20:49:05Z

Sorry, I do not quite understand. You mean 3,2,1,0 is different from 0,1,2,3?

jchodera · 2015-07-29T20:52:35Z

Sorry, I do not quite understand. You mean 3,2,1,0 is different from 0,1,2,3?

Yes! The GPUs are renumbered in reverse order if CUDA_VISIBLE_DEVICES is set to 3,2,1,0. This may necessitate you reversing the order of the devices passed to docker.

It may be simpler just to get rid of CUDA_VISIBLE_DEVICES by unsetting that environment variable, or setting it to the proper order of 0,1,2,3.

LijieTu · 2015-07-29T20:58:33Z

Thank you.

But on gpu-2-5, I tried echo $CUDA_VISIBLE_DEVICES and get 3,2,1,0.
And I can still run torch in docker without errors on gpu-2-5.

Let me try reverse the order on 2-7 or 1-5 first

LijieTu · 2015-07-29T21:08:28Z

Export CUDA_VISIBLE_DEVICES="0,1,2,3" on gpu-1-5. Open docker but still no luck.

jchodera · 2015-07-29T21:38:03Z

What is the full command you use to run docker? And where is the docker image?

tatarsky · 2015-07-29T21:49:01Z

I am back around and monitoring but don't have much to add. I show several qlua processes on gpu-1-4 as @jchodera via his docker. Note that gpu-1-5 is GTX 680 based and gpu-2-7 is TITAN based.

LijieTu · 2015-07-29T21:51:55Z

I run the docker image: kaixhin/cuda-torch
The command is:
docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

Other details are on my first comment. The docker image should be on those nodes we mentioned

LijieTu · 2015-07-29T21:52:49Z

The command is for 4 GPUs, as mentioned here:
https://registry.hub.docker.com/u/kaixhin/cuda-torch/

jchodera · 2015-07-29T21:55:18Z

This actually worked fine for me on gpu-1-5 without CUDA_VISIBLE_DEVICES set:

[chodera@gpu-1-5 ~]$ docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
root@0394cc95c8f2:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'; require 'cutorch'
>

tatarsky · 2015-07-29T21:56:40Z

Am I to understand this works on the other GPU nodes?

Reason I ask is we've seen some cases where the GPUs require a rmmod/modprobe reset under some conditions to clear their brain as it were.

If you wanted I could issue a reservation on the GPUs on one of these nodes and do so...just to see...although as I finish typing @jchodera has cast some doubt that is the problem ;)

LijieTu · 2015-07-29T22:01:32Z

This is what I saw a while ago. Very strange.

lijie@mskcc-ln1 ~ > qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu
qsub: waiting for job 4692428.mskcc-fe1.local to start
qsub: job 4692428.mskcc-fe1.local ready

lijie@gpu-1-5 ~ > docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

root@e1a85bb50ebe:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'

cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
    [C]: at 0x7f844c01d600
    [C]: in function 'require'
    /root/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    [string "require 'cutorch'"]:1: in main chunk
    [C]: at 0x7f845419be10
>

jchodera · 2015-07-29T22:04:21Z

Did you unset CUDA_VISIBLE_DEVICES and still see this issue?

LijieTu · 2015-07-29T22:05:38Z

Yes, I did.

Export CUDA_VISIBLE_DEVICES="0,1,2,3" on gpu-1-5. Open docker but still no luck.

jchodera · 2015-07-29T22:06:34Z

No, I meant unset CUDA_VISIBLE_DEVICES.

LijieTu · 2015-07-29T22:07:26Z

@tatarsky
Everything is fine on GPU-2-5

lijie@gpu-2-5 ~ > docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
root@4820e33a2c5b:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'
>

LijieTu · 2015-07-29T22:10:44Z

lijie@mskcc-ln1 ~ > qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu
qsub: waiting for job 4692454.mskcc-fe1.local to start
qsub: job 4692454.mskcc-fe1.local ready

lijie@gpu-1-5 ~ > echo $CUDA_VISIBLE_DEVICES 
3,2,1,0
lijie@gpu-1-5 ~ > unset CUDA_VISIBLE_DEVICES
lijie@gpu-1-5 ~ > echo $CUDA_VISIBLE_DEVICES

lijie@gpu-1-5 ~ > docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

root@dc3353405098:~/torch# qlua 
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'
cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
    [C]: at 0x7fc5a50af600
    [C]: in function 'require'
    /root/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    [string "require 'cutorch'"]:1: in main chunk
    [C]: at 0x7fc5ad22de10
>

No, it does not work either

jchodera · 2015-07-29T22:13:13Z

OK, I see the same thing as you now:

[chodera@gpu-1-5 ~/docker]$ docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
root@310e508cc2db:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'
cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
    [C]: at 0x7ff7b43ec600
    [C]: in function 'require'
    /root/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    [string "require 'cutorch'"]:1: in main chunk
    [C]: at 0x7ff7badd5e10
>

Weird.

tatarsky · 2015-07-29T22:13:38Z

OK. So my suggestion is we reserve the gpus on 1-5 and rmmod/modprobe the driver.

jchodera · 2015-07-29T22:13:55Z

The reason is that the GPUs are now in thread-exclusive mode:

root@310e508cc2db:~/torch# nvidia-smi
Wed Jul 29 22:13:32 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    462MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   40C    P0    N/A /  N/A |    433MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 31%   43C    P0    N/A /  N/A |    433MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    433MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0            C+G   Not Supported                                         |
|    1            C+G   Not Supported                                         |
|    2            C+G   Not Supported                                         |
|    3            C+G   Not Supported                                         |
+-----------------------------------------------------------------------------+

LijieTu · 2015-07-29T22:14:05Z

For more information, I used to run docker jobs on those two nodes using interactive mode. The first time everything is OK, but the second time I met this problem.

tatarsky · 2015-07-29T22:14:45Z

Are you exiting the interactive shell? Because I don't believe the gpus will reset unless you do...checking the prolog script.

LijieTu · 2015-07-29T22:16:51Z

Maybe some command I used exits the interactive shell? I did not do this intentionally.

jchodera · 2015-07-29T22:17:06Z

It sounds like the GPUs are being set to thread-exclusive mode and you are somehow trying to use them from two-different threads or processes at the same time.

Try changing your GPU request to this:

qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4:shared -N QLOGIN -q gpu

Note the shared keyword.

tatarsky · 2015-07-29T22:17:21Z

@jchodera has a better suggestion.

tatarsky · 2015-07-29T22:17:53Z

Mine was going to be to use nvidia-smi to alter the above but it makes more sense for the scheduler to do so.

LijieTu · 2015-08-05T11:19:46Z

lijie@gpu-2-13 ~ > ssh gpu-2-7 ps ax|grep lua 
30494 pts/3    Sl+  859:53 qlua
lijie@gpu-2-13 ~ > ssh gpu-2-7 pgrep -u root qlua
30494
lijie@gpu-2-13 ~ > ssh gpu-1-10 ps ax|grep lua 
28104 pts/2    Sl+  823:11 qlua
lijie@gpu-2-13 ~ > ssh gpu-1-10 pgrep -u root qlua
28104

I guess here the two qlua processes are stuck

tatarsky · 2015-08-05T15:22:38Z

Please have your PI contact me to talk about this effort and its requirements.

tatarsky · 2015-08-05T16:19:38Z

Also, this may be helpful debugging this.

As far as docker is concerned I believe its because this docker image is still "running".

docker ps
8aab41ba0fb3        kaixhin/cuda-torch:latest   "bash"              3 weeks ago         Up 3 weeks                              determined_wright

ratsch · 2015-08-05T16:30:01Z

@tatarsky let's talk about it these days... best on the phone.

tatarsky · 2015-08-05T16:34:33Z

You know where to find me. I believe the container and the still running qlua can be cleaned up with:

docker kill 8aab41ba0fb3

There doesn't appear to be much security in that as it appears the users in group docker can issue that. @LijieTu please try that on systems you note as your account please (you will have to ssh over to the unit and also do a docker ps to get the proper container id).

The real question is why stray docker containers are being left around however. Which may involve the process not getting sent a proper termination signal.

LijieTu · 2015-08-05T16:50:57Z

@tatarsky Just close all the containers that I'm sure belong to me. Now the gpu-2-7 and gpu-1-10 are clear.

So probably this time the situation is different? I remember last time only you have the authority to clean the qlua run by root.

tatarsky · 2015-08-05T16:55:07Z

I don't recall if I checked docker ps last time. I used lsof. If docker ps does the right thing and allows you to clean up a "disconnected container" for lack of a better term thats the right way to go.

I however don't like the fact processes are out on nodes after Torque I assume has killed them I assume due to walltime. Something is not right there and I am looking for solutions with Adaptive.

LijieTu · 2015-08-05T17:00:06Z

I see. Next time I shall check with docker ps first whether those containers could be killed there.

Yes, the two stuck ones mentioned are due to walltime limit. It would be ideal If the docker could exit after walltime.

Thanks.

tatarsky · 2015-08-05T17:29:34Z

I am trying to explain what is being seen to Adaptive. And if its a docker matter or Torque matter. Somewhere a signal is not getting honored.

tatarsky · 2015-08-05T18:20:33Z

My tests using a 30 second wall time seem to involve the "-t" flag being the primary culprit when running interactively. If you leave off -t it seems the qsub walltime signals do get through. So I suspect the allocation of the pseudo terminal in the LXC container is eating the signal somewhere.

To see what I am seeing:

qsub -I -l nodes=1,walltime=00:00:30 -q active

once on a node:

docker run -i ubuntu bash

in another shell watch with docker ps your container
wait 30 seconds your container will die.
Repeat only add "-t"
your container will not die when killed by the scheduler.

There also appears to be a way to save out the cid which maybe a custom epilog script would let you clean up if you really require the -t. I'll look at that shortly.

So I suspect the "real cure" is either run docker in a mode that its executing a script in the container non-interactively or avoid -t or be prepared to go cleanup.

jchodera · 2015-08-05T23:43:33Z

You definitely want to leave out the -i and -t flags when running docker inside a batch queue script.

Here an example of the script I am currently using:

#!/bin/bash
#
# Set low priority
#PBS -p -1024
#
# Array job: Run 10 WUs total, allowing 2 to run at a time.
#PBS -t 1-10%2
#
# Set a maximum wall time greater than the time per WU (or else no WUs will finish)
#PBS -l walltime=12:00:00
#
# Use the GPU queue
#PBS -q gpu
#
# Reserve one GPU
#PBS -l nodes=1:ppn=1:gpus=1:exclusive

# Set the project key here.
export PROJECT_KEY=10495

# Change to working directory
cd "$PBS_O_WORKDIR"

# Set CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=`cat $PBS_GPUFILE | awk -F"-gpu" '{ printf A$2;A=","}'`

# Run exactly one work unit
docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia${CUDA_VISIBLE_DEVICES}:/dev/nvidia0 jchodera/docker-fah-client /bin/sh -c 'cd fah && ./FAHClient --client-type=INTERNAL --project-key=$PROJECT_KEY --max-units=1'

tatarsky · 2015-08-05T23:49:17Z

Yes I believe the issue with the "stray" dockers is due to his use of the active queue and qsub -I (see up a few for his example) and running docker interactively.

Your script looks like a very nice wrapper if the torch qlua items can be passed on a single line. Perhaps thats a wiki add for CUDA+docker!

I've also determined a user epilog script can be used to "clean up" but it requires the docker image output its cid which can be done with the docker run --cidfile=(some file the epilog script reads)

Finally door #3 several cluster threads for various folks reference cleanup scripts that base their decision to kill docker containers on the output of:

docker ps|grep weeks

Which returns several examples of I suspect detached container out there. I will not attempt a cleanup script at this time but add it to the possible need list.

jchodera · 2015-08-05T23:57:21Z

@tatarsky: All excellent ideas!

tatarsky · 2015-08-06T00:03:44Z

My main annoyance (and it may be something simple I've not figured out) is determining the UID that launched the container. It has to be something simple so I'm just missing it.

LijieTu · 2015-08-06T01:18:33Z

Thank you!
+1 for the wiki add about CUDA+docker.

tatarsky · 2015-08-06T15:15:12Z

@LijieTu I show you have probably a few "bash" only docker torch instances still out there and I can clean them up or you can. I am working on a shell of a cleaner script but until I can better determine the uid of the exec to cross check with Torque I am doing so manually.

gpu-2-11: 1eaef222feaf        kaixhin/cuda-torch:latest           "bash"   
gpu-2-14: 8c1a1a26cb42        kaixhin/cuda-torch:latest           "bash"   
gpu-2-10: e2967f66092d        kaixhin/cuda-torch:latest           "bash"   
gpu-2-6: 5f13364b710d        kaixhin/cuda-torch:latest           "bash"    
gpu-2-6: f66f25eb1cd9        kaixhin/cuda-torch:latest           "bash"    
gpu-2-6: ff5c8636861b        kaixhin/cuda-torch:latest           "bash"    
gpu-2-13: 9d1b5be3c1b1        kaixhin/cuda-torch:latest           "bash"   
gpu-2-12: 569de30cf1c3        kaixhin/cuda-torch:latest           "bash"

LijieTu · 2015-08-06T16:06:13Z

@tatarsky
Yeah, I would appreciate it if you could clean them up. But before doing that, we just want to make sure no one else is running lua/torch on the nodes.

jchodera · 2015-08-06T16:43:10Z

If these look "stale" (>24h), I think you can clean these. I'm not using kaixhin/cuda-torch anymore.

jchodera · 2015-08-06T16:43:37Z

Presumably long-running jobs would not just drop into bash but would instead ask docker to run the executable?

tatarsky · 2015-08-06T16:46:05Z

Correct. I show these purely sitting there in containers running bash. Which may become another "selector" for cleaning docker images. Private git item coming.

akahles · 2015-08-06T16:58:58Z

Just to throw this in when talking about selectors cleaning up long running jobs. There is also a demon mode for docker images that would be useful if I would like to run a tiny server application on each node that provides information on a certain high port to my application. Just wanted to make sure such use cases are taken into consideration and not to remove these instances which don't do much in the background ...

tatarsky · 2015-08-06T16:59:28Z

Are they via the scheduler?

akahles · 2015-08-06T17:00:28Z

I don't have them currently but was thinking about having them. Maybe there is also a way around the scheduler if these processes have very strict resource limits.

tatarsky · 2015-08-06T17:02:33Z

Then at current I would need to have some way of knowing that a long running docker container was "your app". I am only looking to deal with obviously idle containers.

Which currently I judge by the output of docker top having only a shell and with no run time on it.

I assume if you do a "docker top" on your item, that won't read "bash with 00:00:00" of compute time.

Note clearly: I don't kill anything at this time. I am only investigating how we can manage this a bit saner.

tatarsky · 2015-08-06T17:03:24Z

Here to be clear is what I consider "obviously idle"

docker top 5f13364b710d
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                21212               16902               0                   Jul16               pts/6               00:00:00            bash

akahles · 2015-08-06T17:04:19Z

Ok, thanks. Just wanted to throw this in. Will try in the future and report back if any problems should occur.

jchodera · 2015-08-06T22:18:36Z

It would be great to hear from all the docker users about how you are using docker in hal and how we can most easily have zombie docker processes cleaned up! I guess the advantage of docker is that idle resource usage is minimal normally, but binding up GPUs quickly becomes problematic.

tatarsky · 2015-08-10T16:15:49Z

I believe most items in here are now overlapping with other issues. So I'm going to close this and continue to monitor the other ones.

#299 for particular matters involving the qlua usages
#288 might make a good spot for people to comment on how they are using docker.

As always you may disagree with my desire to close and reopen ;)

LijieTu changed the title ~~Torch CUDA package do not work on gpu-1-5 and gpu-2-7~~ Torch CUDA package does not work on gpu-1-5 and gpu-2-7 Jul 29, 2015

tatarsky mentioned this issue Aug 9, 2015

Stuck or out-of-queue 'qlua' jobs running on gpu-2-10 #299

Closed

tatarsky closed this as completed Aug 10, 2015

Torch CUDA package does not work on gpu-1-5 and gpu-2-7 #293

Torch CUDA package does not work on gpu-1-5 and gpu-2-7 #293

Comments

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

jchodera commented Jul 29, 2015

jchodera commented Jul 29, 2015

LijieTu commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

LijieTu commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

tatarsky commented Jul 29, 2015

LijieTu commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

tatarsky commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

LijieTu commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

tatarsky commented Jul 29, 2015

jchodera commented Jul 29, 2015

LijieTu commented Jul 29, 2015

tatarsky commented Jul 29, 2015

LijieTu commented Jul 29, 2015

jchodera commented Jul 29, 2015

tatarsky commented Jul 29, 2015

tatarsky commented Jul 29, 2015

LijieTu commented Aug 5, 2015

tatarsky commented Aug 5, 2015

tatarsky commented Aug 5, 2015

ratsch commented Aug 5, 2015

tatarsky commented Aug 5, 2015

LijieTu commented Aug 5, 2015

tatarsky commented Aug 5, 2015

LijieTu commented Aug 5, 2015

tatarsky commented Aug 5, 2015

tatarsky commented Aug 5, 2015

jchodera commented Aug 5, 2015

tatarsky commented Aug 5, 2015

jchodera commented Aug 5, 2015

tatarsky commented Aug 6, 2015

LijieTu commented Aug 6, 2015

tatarsky commented Aug 6, 2015

LijieTu commented Aug 6, 2015

jchodera commented Aug 6, 2015

jchodera commented Aug 6, 2015

tatarsky commented Aug 6, 2015

akahles commented Aug 6, 2015

tatarsky commented Aug 6, 2015

akahles commented Aug 6, 2015

tatarsky commented Aug 6, 2015

tatarsky commented Aug 6, 2015

akahles commented Aug 6, 2015

jchodera commented Aug 6, 2015 via email

tatarsky commented Aug 10, 2015