Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch CUDA package does not work on gpu-1-5 and gpu-2-7 #293

Closed
LijieTu opened this issue Jul 29, 2015 · 141 comments
Closed

Torch CUDA package does not work on gpu-1-5 and gpu-2-7 #293

LijieTu opened this issue Jul 29, 2015 · 141 comments

Comments

@LijieTu
Copy link

LijieTu commented Jul 29, 2015

I try running torch CUDA package on some nodes, but something weird happens

  1. First reserve a node:

qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu

  1. Then run the docker image: kaixhin/cuda-torch
    docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

3.Type in "qlua" or "th"
4. require 'torch'; require 'cutorch'

For gpu-1-5 and gpu-2-7, I got message like:
cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20

I checked the CUDA, it should be fine.
lijie@gpu-1-5 ~ > echo $CUDA_VISIBLE_DEVICES
3,2,1,0

This situation only happens on these two nodes, I tried the same on gpu-2-5, no errors.

@LijieTu LijieTu changed the title Torch CUDA package do not work on gpu-1-5 and gpu-2-7 Torch CUDA package does not work on gpu-1-5 and gpu-2-7 Jul 29, 2015
@jchodera
Copy link
Member

Hm. gpu-1-5 appears to show some residual memory used on GPU 0:

[chodera@gpu-1-5 ~]$ nvidia-smi
Wed Jul 29 16:43:17 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    392MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   40C    P0    N/A /  N/A |    365MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 30%   43C    P0    N/A /  N/A |    365MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    365MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0            C+G   Not Supported                                         |
|    1            C+G   Not Supported                                         |
|    2            C+G   Not Supported                                         |
|    3            C+G   Not Supported                                         |
+-----------------------------------------------------------------------------+

but GPU-2-7 is actually in use right now:

[chodera@gpu-2-7 ~]$ nvidia-smi
Wed Jul 29 16:43:59 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 33%   49C    P0    70W / 250W |    432MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 35%   52C    P0    70W / 250W |    406MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 35%   53C    P0    70W / 250W |    406MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 36%   54C    P0    70W / 250W |    406MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     30604    C   qlua                                           415MiB |
|    1     30604    C   qlua                                           388MiB |
|    2     30604    C   qlua                                           388MiB |
|    3     30604    C   qlua                                           388MiB |
+-----------------------------------------------------------------------------+

@jchodera
Copy link
Member

Ah, you are in fact running on gpu-2-7 right now!

@jchodera
Copy link
Member

On gpu-2-7, what if you try setting CUDA_VISIBLE_DEVICES to 0,1,2,3 instead, or just clearing it altogether?

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

@jchodera Oh, I was testing on GPU-2-7, now I exit

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

Sorry, I do not quite understand. You mean 3,2,1,0 is different from 0,1,2,3?

@jchodera
Copy link
Member

Sorry, I do not quite understand. You mean 3,2,1,0 is different from 0,1,2,3?

Yes! The GPUs are renumbered in reverse order if CUDA_VISIBLE_DEVICES is set to 3,2,1,0. This may necessitate you reversing the order of the devices passed to docker.

It may be simpler just to get rid of CUDA_VISIBLE_DEVICES by unsetting that environment variable, or setting it to the proper order of 0,1,2,3.

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

Thank you.

But on gpu-2-5, I tried echo $CUDA_VISIBLE_DEVICES and get 3,2,1,0.
And I can still run torch in docker without errors on gpu-2-5.

Let me try reverse the order on 2-7 or 1-5 first

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

Export CUDA_VISIBLE_DEVICES="0,1,2,3" on gpu-1-5. Open docker but still no luck.

@jchodera
Copy link
Member

What is the full command you use to run docker? And where is the docker image?

@tatarsky
Copy link
Contributor

I am back around and monitoring but don't have much to add. I show several qlua processes on gpu-1-4 as @jchodera via his docker. Note that gpu-1-5 is GTX 680 based and gpu-2-7 is TITAN based.

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

I run the docker image: kaixhin/cuda-torch
The command is:
docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

Other details are on my first comment. The docker image should be on those nodes we mentioned

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

The command is for 4 GPUs, as mentioned here:
https://registry.hub.docker.com/u/kaixhin/cuda-torch/

@jchodera
Copy link
Member

This actually worked fine for me on gpu-1-5 without CUDA_VISIBLE_DEVICES set:

[chodera@gpu-1-5 ~]$ docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
root@0394cc95c8f2:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'; require 'cutorch'
> 

@tatarsky
Copy link
Contributor

Am I to understand this works on the other GPU nodes?

Reason I ask is we've seen some cases where the GPUs require a rmmod/modprobe reset under some conditions to clear their brain as it were.

If you wanted I could issue a reservation on the GPUs on one of these nodes and do so...just to see...although as I finish typing @jchodera has cast some doubt that is the problem ;)

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

This is what I saw a while ago. Very strange.

lijie@mskcc-ln1 ~ > qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu
qsub: waiting for job 4692428.mskcc-fe1.local to start
qsub: job 4692428.mskcc-fe1.local ready

lijie@gpu-1-5 ~ > docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

root@e1a85bb50ebe:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'

cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
    [C]: at 0x7f844c01d600
    [C]: in function 'require'
    /root/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    [string "require 'cutorch'"]:1: in main chunk
    [C]: at 0x7f845419be10
> 

@jchodera
Copy link
Member

Did you unset CUDA_VISIBLE_DEVICES and still see this issue?

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

Yes, I did.

Export CUDA_VISIBLE_DEVICES="0,1,2,3" on gpu-1-5. Open docker but still no luck.

@jchodera
Copy link
Member

No, I meant unset CUDA_VISIBLE_DEVICES.

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

@tatarsky
Everything is fine on GPU-2-5

lijie@gpu-2-5 ~ > docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
root@4820e33a2c5b:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'
>  

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

lijie@mskcc-ln1 ~ > qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu
qsub: waiting for job 4692454.mskcc-fe1.local to start
qsub: job 4692454.mskcc-fe1.local ready

lijie@gpu-1-5 ~ > echo $CUDA_VISIBLE_DEVICES 
3,2,1,0
lijie@gpu-1-5 ~ > unset CUDA_VISIBLE_DEVICES
lijie@gpu-1-5 ~ > echo $CUDA_VISIBLE_DEVICES

lijie@gpu-1-5 ~ > docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash

root@dc3353405098:~/torch# qlua 
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'
cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
    [C]: at 0x7fc5a50af600
    [C]: in function 'require'
    /root/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    [string "require 'cutorch'"]:1: in main chunk
    [C]: at 0x7fc5ad22de10
> 

No, it does not work either

@jchodera
Copy link
Member

OK, I see the same thing as you now:

[chodera@gpu-1-5 ~/docker]$ docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
root@310e508cc2db:~/torch# qlua
Unable to connect X11 server (continuing with -nographics)
Lua 5.1  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> require 'torch'
> require 'cutorch'
cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
    [C]: at 0x7ff7b43ec600
    [C]: in function 'require'
    /root/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    [string "require 'cutorch'"]:1: in main chunk
    [C]: at 0x7ff7badd5e10
> 

Weird.

@tatarsky
Copy link
Contributor

OK. So my suggestion is we reserve the gpus on 1-5 and rmmod/modprobe the driver.

@jchodera
Copy link
Member

The reason is that the GPUs are now in thread-exclusive mode:

root@310e508cc2db:~/torch# nvidia-smi
Wed Jul 29 22:13:32 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    462MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   40C    P0    N/A /  N/A |    433MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 31%   43C    P0    N/A /  N/A |    433MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   41C    P0    N/A /  N/A |    433MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0            C+G   Not Supported                                         |
|    1            C+G   Not Supported                                         |
|    2            C+G   Not Supported                                         |
|    3            C+G   Not Supported                                         |
+-----------------------------------------------------------------------------+

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

For more information, I used to run docker jobs on those two nodes using interactive mode. The first time everything is OK, but the second time I met this problem.

@tatarsky
Copy link
Contributor

Are you exiting the interactive shell? Because I don't believe the gpus will reset unless you do...checking the prolog script.

@LijieTu
Copy link
Author

LijieTu commented Jul 29, 2015

Maybe some command I used exits the interactive shell? I did not do this intentionally.

@jchodera
Copy link
Member

It sounds like the GPUs are being set to thread-exclusive mode and you are somehow trying to use them from two-different threads or processes at the same time.

Try changing your GPU request to this:

qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4:shared -N QLOGIN -q gpu

Note the shared keyword.

@tatarsky
Copy link
Contributor

@jchodera has a better suggestion.

@tatarsky
Copy link
Contributor

Mine was going to be to use nvidia-smi to alter the above but it makes more sense for the scheduler to do so.

@LijieTu
Copy link
Author

LijieTu commented Aug 5, 2015

lijie@gpu-2-13 ~ > ssh gpu-2-7 ps ax|grep lua 
30494 pts/3    Sl+  859:53 qlua
lijie@gpu-2-13 ~ > ssh gpu-2-7 pgrep -u root qlua
30494
lijie@gpu-2-13 ~ > ssh gpu-1-10 ps ax|grep lua 
28104 pts/2    Sl+  823:11 qlua
lijie@gpu-2-13 ~ > ssh gpu-1-10 pgrep -u root qlua
28104

I guess here the two qlua processes are stuck

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

Please have your PI contact me to talk about this effort and its requirements.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

Also, this may be helpful debugging this.

As far as docker is concerned I believe its because this docker image is still "running".

docker ps
8aab41ba0fb3        kaixhin/cuda-torch:latest   "bash"              3 weeks ago         Up 3 weeks                              determined_wright   

@ratsch
Copy link

ratsch commented Aug 5, 2015

@tatarsky let's talk about it these days... best on the phone.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

You know where to find me. I believe the container and the still running qlua can be cleaned up with:

docker kill 8aab41ba0fb3

There doesn't appear to be much security in that as it appears the users in group docker can issue that. @LijieTu please try that on systems you note as your account please (you will have to ssh over to the unit and also do a docker ps to get the proper container id).

The real question is why stray docker containers are being left around however. Which may involve the process not getting sent a proper termination signal.

@LijieTu
Copy link
Author

LijieTu commented Aug 5, 2015

@tatarsky Just close all the containers that I'm sure belong to me. Now the gpu-2-7 and gpu-1-10 are clear.

So probably this time the situation is different? I remember last time only you have the authority to clean the qlua run by root.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

I don't recall if I checked docker ps last time. I used lsof. If docker ps does the right thing and allows you to clean up a "disconnected container" for lack of a better term thats the right way to go.

I however don't like the fact processes are out on nodes after Torque I assume has killed them I assume due to walltime. Something is not right there and I am looking for solutions with Adaptive.

@LijieTu
Copy link
Author

LijieTu commented Aug 5, 2015

I see. Next time I shall check with docker ps first whether those containers could be killed there.

Yes, the two stuck ones mentioned are due to walltime limit. It would be ideal If the docker could exit after walltime.

Thanks.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

I am trying to explain what is being seen to Adaptive. And if its a docker matter or Torque matter. Somewhere a signal is not getting honored.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

My tests using a 30 second wall time seem to involve the "-t" flag being the primary culprit when running interactively. If you leave off -t it seems the qsub walltime signals do get through. So I suspect the allocation of the pseudo terminal in the LXC container is eating the signal somewhere.

To see what I am seeing:

qsub -I -l nodes=1,walltime=00:00:30 -q active 

once on a node:

docker run -i ubuntu bash

in another shell watch with docker ps your container
wait 30 seconds your container will die.
Repeat only add "-t"
your container will not die when killed by the scheduler.

There also appears to be a way to save out the cid which maybe a custom epilog script would let you clean up if you really require the -t. I'll look at that shortly.

So I suspect the "real cure" is either run docker in a mode that its executing a script in the container non-interactively or avoid -t or be prepared to go cleanup.

@jchodera
Copy link
Member

jchodera commented Aug 5, 2015

You definitely want to leave out the -i and -t flags when running docker inside a batch queue script.

Here an example of the script I am currently using:

#!/bin/bash
#
# Set low priority
#PBS -p -1024
#
# Array job: Run 10 WUs total, allowing 2 to run at a time.
#PBS -t 1-10%2
#
# Set a maximum wall time greater than the time per WU (or else no WUs will finish)
#PBS -l walltime=12:00:00
#
# Use the GPU queue
#PBS -q gpu
#
# Reserve one GPU
#PBS -l nodes=1:ppn=1:gpus=1:exclusive

# Set the project key here.
export PROJECT_KEY=10495

# Change to working directory
cd "$PBS_O_WORKDIR"

# Set CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=`cat $PBS_GPUFILE | awk -F"-gpu" '{ printf A$2;A=","}'`

# Run exactly one work unit
docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia${CUDA_VISIBLE_DEVICES}:/dev/nvidia0 jchodera/docker-fah-client /bin/sh -c 'cd fah && ./FAHClient --client-type=INTERNAL --project-key=$PROJECT_KEY --max-units=1'

@tatarsky
Copy link
Contributor

tatarsky commented Aug 5, 2015

Yes I believe the issue with the "stray" dockers is due to his use of the active queue and qsub -I (see up a few for his example) and running docker interactively.

Your script looks like a very nice wrapper if the torch qlua items can be passed on a single line. Perhaps thats a wiki add for CUDA+docker!

I've also determined a user epilog script can be used to "clean up" but it requires the docker image output its cid which can be done with the docker run --cidfile=(some file the epilog script reads)

Finally door #3 several cluster threads for various folks reference cleanup scripts that base their decision to kill docker containers on the output of:

docker ps|grep weeks

Which returns several examples of I suspect detached container out there. I will not attempt a cleanup script at this time but add it to the possible need list.

@jchodera
Copy link
Member

jchodera commented Aug 5, 2015

@tatarsky: All excellent ideas!

@tatarsky
Copy link
Contributor

tatarsky commented Aug 6, 2015

My main annoyance (and it may be something simple I've not figured out) is determining the UID that launched the container. It has to be something simple so I'm just missing it.

@LijieTu
Copy link
Author

LijieTu commented Aug 6, 2015

Thank you!
+1 for the wiki add about CUDA+docker.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 6, 2015

@LijieTu I show you have probably a few "bash" only docker torch instances still out there and I can clean them up or you can. I am working on a shell of a cleaner script but until I can better determine the uid of the exec to cross check with Torque I am doing so manually.

gpu-2-11: 1eaef222feaf        kaixhin/cuda-torch:latest           "bash"   
gpu-2-14: 8c1a1a26cb42        kaixhin/cuda-torch:latest           "bash"   
gpu-2-10: e2967f66092d        kaixhin/cuda-torch:latest           "bash"   
gpu-2-6: 5f13364b710d        kaixhin/cuda-torch:latest           "bash"    
gpu-2-6: f66f25eb1cd9        kaixhin/cuda-torch:latest           "bash"    
gpu-2-6: ff5c8636861b        kaixhin/cuda-torch:latest           "bash"    
gpu-2-13: 9d1b5be3c1b1        kaixhin/cuda-torch:latest           "bash"   
gpu-2-12: 569de30cf1c3        kaixhin/cuda-torch:latest           "bash"   

@LijieTu
Copy link
Author

LijieTu commented Aug 6, 2015

@tatarsky
Yeah, I would appreciate it if you could clean them up. But before doing that, we just want to make sure no one else is running lua/torch on the nodes.

@jchodera
Copy link
Member

jchodera commented Aug 6, 2015

If these look "stale" (>24h), I think you can clean these. I'm not using kaixhin/cuda-torch anymore.

@jchodera
Copy link
Member

jchodera commented Aug 6, 2015

Presumably long-running jobs would not just drop into bash but would instead ask docker to run the executable?

@tatarsky
Copy link
Contributor

tatarsky commented Aug 6, 2015

Correct. I show these purely sitting there in containers running bash. Which may become another "selector" for cleaning docker images. Private git item coming.

@akahles
Copy link

akahles commented Aug 6, 2015

Just to throw this in when talking about selectors cleaning up long running jobs. There is also a demon mode for docker images that would be useful if I would like to run a tiny server application on each node that provides information on a certain high port to my application. Just wanted to make sure such use cases are taken into consideration and not to remove these instances which don't do much in the background ...

@tatarsky
Copy link
Contributor

tatarsky commented Aug 6, 2015

Are they via the scheduler?

@akahles
Copy link

akahles commented Aug 6, 2015

I don't have them currently but was thinking about having them. Maybe there is also a way around the scheduler if these processes have very strict resource limits.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 6, 2015

Then at current I would need to have some way of knowing that a long running docker container was "your app". I am only looking to deal with obviously idle containers.

Which currently I judge by the output of docker top having only a shell and with no run time on it.

I assume if you do a "docker top" on your item, that won't read "bash with 00:00:00" of compute time.

Note clearly: I don't kill anything at this time. I am only investigating how we can manage this a bit saner.

@tatarsky
Copy link
Contributor

tatarsky commented Aug 6, 2015

Here to be clear is what I consider "obviously idle"

docker top 5f13364b710d
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                21212               16902               0                   Jul16               pts/6               00:00:00            bash

@akahles
Copy link

akahles commented Aug 6, 2015

Ok, thanks. Just wanted to throw this in. Will try in the future and report back if any problems should occur.

@jchodera
Copy link
Member

jchodera commented Aug 6, 2015 via email

@tatarsky
Copy link
Contributor

I believe most items in here are now overlapping with other issues. So I'm going to close this and continue to monitor the other ones.

#299 for particular matters involving the qlua usages
#288 might make a good spot for people to comment on how they are using docker.

As always you may disagree with my desire to close and reopen ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants