IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

tatarsky · 2016-05-16T15:09:47Z

The Fuchs and SBIO groups have authorized the adding of their purchased nodes to the batch queue. This represents a considerable number of cores, ram, gpus and a Tesla card.

We have node cc27 offlined for test that the puppet process has all needed items on it. We are asking you validate there is possible ASAP so that it does not eat jobs. The unit also contains a Tesla and is marked as such in its properties (tesla)

While we intend to deploy some variations of the items in #407 we will repeat them separately as implemented. We will start with a goal of getting systems into the batch queue.

If no issues are found I will slowly add these nodes over the course of the day.

The text was updated successfully, but these errors were encountered:

tatarsky · 2016-05-16T17:02:43Z

Going to add this node and others in an hour.

tatarsky · 2016-05-16T17:49:25Z

Adding. Report problems please to:

[email protected]

tatarsky · 2016-05-17T22:02:53Z

Gpu oversubscription not working right for cc27. Check rules to allow it even if all slots are filled batch.

tatarsky · 2016-05-18T20:02:37Z

OK. I think I fixed this now.

qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230098.hal-sched1.local to start
qsub: job 7230098.hal-sched1.local ready

[cc27 me ~]$ nvidia-smi
(stuff about the tesla)

Any tests that people could do to confirm this system is actually usable would be appreciated. It is currently usable I believe from gpu and batch queues.

jchodera · 2016-05-18T20:52:10Z

I don't get cc27 when I request tesla---I erroneously get a titan

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230614.hal-sched1.local to start
qsub: job 7230614.hal-sched1.local ready

[chodera@gpu-1-16 ~]$ nvidia-smi
Wed May 18 16:51:03 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   34C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   31C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   37C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   37C    P8    14W / 250W |    295MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     22437    C   ...grlab/home/karaletsos/anaconda/bin/python   278MiB |
+-----------------------------------------------------------------------------+
[chodera@gpu-1-16 ~]$

akahles · 2016-05-18T20:53:31Z

I think you requested telsa instead of tesla so the label might have been ignored.

jchodera · 2016-05-18T20:53:31Z

Whoops---that's because I cut-and-paste a typo (telsa).

jchodera · 2016-05-18T20:53:57Z

[chodera@mskcc-ln1 ~]$ qsub -I -l notes=1:ppn=1:tesla:gpus=1 -l mem=1gb
qsub: submit error (Unknown resource type  Resource_List.notes)

akahles · 2016-05-18T20:55:29Z

Ok, sorry about that - I had assumed the label was meant to be tesla

jchodera · 2016-05-18T20:56:10Z

I think it's supposed to be tesla. I'm just not sure how to get it to work.

tatarsky · 2016-05-18T21:07:15Z

Your second example has "notes" instead of "nodes"

tatarsky · 2016-05-18T21:11:21Z

My test still seems to take me there:

qsub -l nodes=1:ppn=1:gpus=1:shared:tesla -q gpu -I
qsub: waiting for job 7230761.hal-sched1.local to start
qsub: job 7230761.hal-sched1.local ready

[me@cc27 ~]$

jchodera · 2016-05-18T21:13:53Z

Aha! In the queue now!

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start

tatarsky · 2016-05-18T21:14:24Z

Sorry, i exiting.

jchodera · 2016-05-18T21:15:14Z

Success!

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
qsub: job 7230765.hal-sched1.local ready

[chodera@cc27 ~]$ nvidia-smi
Wed May 18 17:11:47 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:81:00.0     Off |                    0 |
| 23%   37C    P0    67W / 235W |     22MiB / 11519MiB |     99%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

tatarsky · 2016-05-18T21:15:25Z

I show you on cc27 in "oversubscribe" slot "49" which is one more than the 48 batch jobs on there right now.

Am curious if the GPU infrastructure is right there though.

tatarsky · 2016-05-18T21:16:02Z

As in Torque GPU variables and such. Been awhile since I looked at all that.

jchodera · 2016-05-18T21:16:03Z

Testing on the GPU...

jchodera · 2016-05-18T21:19:27Z

CUDA_VISIBLE_DEVICES correct:

[chodera@cc27 ~]$ setenv | grep CUDA_
CUDA_VISIBLE_DEVICES=0

It works!

[chodera@cc27 examples]$ python benchmark.py --help
python be   Usage: benchmark.py [options]

Options:
  -h, --help            show this help message and exit
  --platform=PLATFORM   name of the platform to benchmark
  --test=TEST           the test to perform: gbsa, rf, pme, amoebagk, or
                        amoebapme [default: all]
  --pme-cutoff=CUTOFF   direct space cutoff for PME in nm [default: 0.9]
  --seconds=SECONDS     target simulation length in seconds [default: 60]
  --polarization=POLARIZATION
                        the polarization method for AMOEBA: direct,
                        extrapolated, or mutual [default: mutual]
  --mutual-epsilon=EPSILON
                        mutual induced epsilon for AMOEBA [default: 1e-5]
  --heavy-hydrogens     repartition mass to allow a larger time step
  --device=DEVICE       device index for CUDA or OpenCL
  --precision=PRECISION
                        precision mode for CUDA or OpenCL: single, mixed, or
                        double [default: single]
[chodera@cc27 examples]$ python benchmark.py --platform=CUDA --test=pme --seconds=60
Platform: CUDA
Precision: single

Test: pme (cutoff=0.9)
Step Size: 2 fs
Integrated 35138 steps in 60.0692 seconds
101.081 ns/day

tatarsky · 2016-05-18T21:25:12Z

OK cool. I need to validate that bash/sh CUDA_VISIBLE_DEVICES works. But I believe the queue infrastructure is now correct for batch and gpu.

tatarsky · 2016-05-19T16:37:54Z

Looked ok to me. Closing.

tatarsky closed this as completed May 16, 2016

tatarsky reopened this May 17, 2016

tatarsky closed this as completed May 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

tatarsky commented May 16, 2016

tatarsky commented May 16, 2016

tatarsky commented May 16, 2016

tatarsky commented May 17, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

akahles commented May 18, 2016

jchodera commented May 18, 2016

jchodera commented May 18, 2016

akahles commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

tatarsky commented May 19, 2016

IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

Comments

tatarsky commented May 16, 2016

tatarsky commented May 16, 2016

tatarsky commented May 16, 2016

tatarsky commented May 17, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

akahles commented May 18, 2016

jchodera commented May 18, 2016

jchodera commented May 18, 2016

akahles commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

tatarsky commented May 18, 2016

jchodera commented May 18, 2016

jchodera commented May 18, 2016

tatarsky commented May 18, 2016

tatarsky commented May 19, 2016