-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch CUDA package does not work on gpu-1-5 and gpu-2-7 #293
Comments
Hm.
but GPU-2-7 is actually in use right now:
|
Ah, you are in fact running on |
On |
@jchodera Oh, I was testing on |
Sorry, I do not quite understand. You mean |
Yes! The GPUs are renumbered in reverse order if It may be simpler just to get rid of |
Thank you. But on Let me try reverse the order on 2-7 or 1-5 first |
Export |
What is the full command you use to run docker? And where is the docker image? |
I am back around and monitoring but don't have much to add. I show several qlua processes on gpu-1-4 as @jchodera via his docker. Note that gpu-1-5 is GTX 680 based and gpu-2-7 is TITAN based. |
I run the docker image: Other details are on my first comment. The docker image should be on those nodes we mentioned |
The command is for 4 GPUs, as mentioned here: |
This actually worked fine for me on
|
Am I to understand this works on the other GPU nodes? Reason I ask is we've seen some cases where the GPUs require a rmmod/modprobe reset under some conditions to clear their brain as it were. If you wanted I could issue a reservation on the GPUs on one of these nodes and do so...just to see...although as I finish typing @jchodera has cast some doubt that is the problem ;) |
This is what I saw a while ago. Very strange.
|
Did you unset |
Yes, I did.
|
No, I meant |
@tatarsky
|
No, it does not work either |
OK, I see the same thing as you now:
Weird. |
OK. So my suggestion is we reserve the gpus on 1-5 and rmmod/modprobe the driver. |
The reason is that the GPUs are now in thread-exclusive mode:
|
For more information, I used to run docker jobs on those two nodes using interactive mode. The first time everything is OK, but the second time I met this problem. |
Are you exiting the interactive shell? Because I don't believe the gpus will reset unless you do...checking the prolog script. |
Maybe some command I used exits the interactive shell? I did not do this intentionally. |
It sounds like the GPUs are being set to thread-exclusive mode and you are somehow trying to use them from two-different threads or processes at the same time. Try changing your GPU request to this:
Note the |
@jchodera has a better suggestion. |
Mine was going to be to use nvidia-smi to alter the above but it makes more sense for the scheduler to do so. |
I guess here the two qlua processes are stuck |
Please have your PI contact me to talk about this effort and its requirements. |
Also, this may be helpful debugging this. As far as docker is concerned I believe its because this docker image is still "running".
|
@tatarsky let's talk about it these days... best on the phone. |
You know where to find me. I believe the container and the still running qlua can be cleaned up with:
There doesn't appear to be much security in that as it appears the users in group docker can issue that. @LijieTu please try that on systems you note as your account please (you will have to ssh over to the unit and also do a docker ps to get the proper container id). The real question is why stray docker containers are being left around however. Which may involve the process not getting sent a proper termination signal. |
@tatarsky Just close all the containers that I'm sure belong to me. Now the So probably this time the situation is different? I remember last time only you have the authority to clean the qlua run by root. |
I don't recall if I checked docker ps last time. I used lsof. If docker ps does the right thing and allows you to clean up a "disconnected container" for lack of a better term thats the right way to go. I however don't like the fact processes are out on nodes after Torque I assume has killed them I assume due to walltime. Something is not right there and I am looking for solutions with Adaptive. |
I see. Next time I shall check with docker ps first whether those containers could be killed there. Yes, the two stuck ones mentioned are due to walltime limit. It would be ideal If the docker could exit after walltime. Thanks. |
I am trying to explain what is being seen to Adaptive. And if its a docker matter or Torque matter. Somewhere a signal is not getting honored. |
My tests using a 30 second wall time seem to involve the "-t" flag being the primary culprit when running interactively. If you leave off -t it seems the qsub walltime signals do get through. So I suspect the allocation of the pseudo terminal in the LXC container is eating the signal somewhere. To see what I am seeing:
once on a node:
in another shell watch with docker ps your container There also appears to be a way to save out the cid which maybe a custom epilog script would let you clean up if you really require the -t. I'll look at that shortly. So I suspect the "real cure" is either run docker in a mode that its executing a script in the container non-interactively or avoid -t or be prepared to go cleanup. |
You definitely want to leave out the Here an example of the script I am currently using:
|
Yes I believe the issue with the "stray" dockers is due to his use of the active queue and qsub -I (see up a few for his example) and running docker interactively. Your script looks like a very nice wrapper if the torch qlua items can be passed on a single line. Perhaps thats a wiki add for CUDA+docker! I've also determined a user epilog script can be used to "clean up" but it requires the docker image output its cid which can be done with the docker run --cidfile=(some file the epilog script reads) Finally door #3 several cluster threads for various folks reference cleanup scripts that base their decision to kill docker containers on the output of:
Which returns several examples of I suspect detached container out there. I will not attempt a cleanup script at this time but add it to the possible need list. |
@tatarsky: All excellent ideas! |
My main annoyance (and it may be something simple I've not figured out) is determining the UID that launched the container. It has to be something simple so I'm just missing it. |
Thank you! |
@LijieTu I show you have probably a few "bash" only docker torch instances still out there and I can clean them up or you can. I am working on a shell of a cleaner script but until I can better determine the uid of the exec to cross check with Torque I am doing so manually.
|
@tatarsky |
If these look "stale" (>24h), I think you can clean these. I'm not using |
Presumably long-running jobs would not just drop into |
Correct. I show these purely sitting there in containers running bash. Which may become another "selector" for cleaning docker images. Private git item coming. |
Just to throw this in when talking about selectors cleaning up long running jobs. There is also a demon mode for docker images that would be useful if I would like to run a tiny server application on each node that provides information on a certain high port to my application. Just wanted to make sure such use cases are taken into consideration and not to remove these instances which don't do much in the background ... |
Are they via the scheduler? |
I don't have them currently but was thinking about having them. Maybe there is also a way around the scheduler if these processes have very strict resource limits. |
Then at current I would need to have some way of knowing that a long running docker container was "your app". I am only looking to deal with obviously idle containers. Which currently I judge by the output of docker top having only a shell and with no run time on it. I assume if you do a "docker top" on your item, that won't read "bash with 00:00:00" of compute time. Note clearly: I don't kill anything at this time. I am only investigating how we can manage this a bit saner. |
Here to be clear is what I consider "obviously idle"
|
Ok, thanks. Just wanted to throw this in. Will try in the future and report back if any problems should occur. |
It would be great to hear from all the docker users about how you are using
docker in hal and how we can most easily have zombie docker processes
cleaned up! I guess the advantage of docker is that idle resource usage is
minimal normally, but binding up GPUs quickly becomes problematic.
|
I believe most items in here are now overlapping with other issues. So I'm going to close this and continue to monitor the other ones. #299 for particular matters involving the qlua usages As always you may disagree with my desire to close and reopen ;) |
I try running torch CUDA package on some nodes, but something weird happens
qsub -I -l nodes=gpu-1-5:ppn=1,walltime=02:00:00,mem=2gb,vmem=2gb,pmem=2gb,gpus=4 -N QLOGIN -q gpu
kaixhin/cuda-torch
docker run -it --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 kaixhin/cuda-torch bash
3.Type in
"qlua"
or"th"
4.
require 'torch'; require 'cutorch'
For
gpu-1-5
andgpu-2-7
, I got message like:cuda runtime error (10) : invalid device ordinal at /root/torch/extra/cutorch/lib/THC/THCTensorRandom.cu:20
I checked the CUDA, it should be fine.
lijie@gpu-1-5 ~ >
echo $CUDA_VISIBLE_DEVICES
3,2,1,0
This situation only happens on these two nodes, I tried the same on gpu-2-5, no errors.
The text was updated successfully, but these errors were encountered: