-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414
Comments
Going to add this node and others in an hour. |
Adding. Report problems please to: |
Gpu oversubscription not working right for cc27. Check rules to allow it even if all slots are filled batch. |
OK. I think I fixed this now.
Any tests that people could do to confirm this system is actually usable would be appreciated. It is currently usable I believe from gpu and batch queues. |
I don't get
|
I think you requested |
Whoops---that's because I cut-and-paste a typo ( |
[chodera@mskcc-ln1 ~]$ qsub -I -l notes=1:ppn=1:tesla:gpus=1 -l mem=1gb
qsub: submit error (Unknown resource type Resource_List.notes) |
Ok, sorry about that - I had assumed the label was meant to be |
I think it's supposed to be |
Your second example has "notes" instead of "nodes" |
My test still seems to take me there:
|
Aha! In the queue now! [chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start |
Sorry, i exiting. |
Success! [chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
qsub: job 7230765.hal-sched1.local ready
[chodera@cc27 ~]$ nvidia-smi
Wed May 18 17:11:47 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 0000:81:00.0 Off | 0 |
| 23% 37C P0 67W / 235W | 22MiB / 11519MiB | 99% E. Thread |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+ |
I show you on cc27 in "oversubscribe" slot "49" which is one more than the 48 batch jobs on there right now. Am curious if the GPU infrastructure is right there though. |
As in Torque GPU variables and such. Been awhile since I looked at all that. |
Testing on the GPU... |
[chodera@cc27 ~]$ setenv | grep CUDA_
CUDA_VISIBLE_DEVICES=0 It works! [chodera@cc27 examples]$ python benchmark.py --help
python be Usage: benchmark.py [options]
Options:
-h, --help show this help message and exit
--platform=PLATFORM name of the platform to benchmark
--test=TEST the test to perform: gbsa, rf, pme, amoebagk, or
amoebapme [default: all]
--pme-cutoff=CUTOFF direct space cutoff for PME in nm [default: 0.9]
--seconds=SECONDS target simulation length in seconds [default: 60]
--polarization=POLARIZATION
the polarization method for AMOEBA: direct,
extrapolated, or mutual [default: mutual]
--mutual-epsilon=EPSILON
mutual induced epsilon for AMOEBA [default: 1e-5]
--heavy-hydrogens repartition mass to allow a larger time step
--device=DEVICE device index for CUDA or OpenCL
--precision=PRECISION
precision mode for CUDA or OpenCL: single, mixed, or
double [default: single]
[chodera@cc27 examples]$ python benchmark.py --platform=CUDA --test=pme --seconds=60
Platform: CUDA
Precision: single
Test: pme (cutoff=0.9)
Step Size: 2 fs
Integrated 35138 steps in 60.0692 seconds
101.081 ns/day |
OK cool. I need to validate that bash/sh CUDA_VISIBLE_DEVICES works. But I believe the queue infrastructure is now correct for batch and gpu. |
Looked ok to me. Closing. |
The Fuchs and SBIO groups have authorized the adding of their purchased nodes to the batch queue. This represents a considerable number of cores, ram, gpus and a Tesla card.
We have node cc27 offlined for test that the puppet process has all needed items on it. We are asking you validate there is possible ASAP so that it does not eat jobs. The unit also contains a Tesla and is marked as such in its properties (
tesla
)While we intend to deploy some variations of the items in #407 we will repeat them separately as implemented. We will start with a goal of getting systems into the batch queue.
If no issues are found I will slowly add these nodes over the course of the day.
The text was updated successfully, but these errors were encountered: