Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

Closed
tatarsky opened this issue May 16, 2016 · 21 comments

Comments

@tatarsky
Copy link
Contributor

The Fuchs and SBIO groups have authorized the adding of their purchased nodes to the batch queue. This represents a considerable number of cores, ram, gpus and a Tesla card.

We have node cc27 offlined for test that the puppet process has all needed items on it. We are asking you validate there is possible ASAP so that it does not eat jobs. The unit also contains a Tesla and is marked as such in its properties (tesla)

While we intend to deploy some variations of the items in #407 we will repeat them separately as implemented. We will start with a goal of getting systems into the batch queue.

If no issues are found I will slowly add these nodes over the course of the day.

@tatarsky
Copy link
Contributor Author

Going to add this node and others in an hour.

@tatarsky
Copy link
Contributor Author

Adding. Report problems please to:

[email protected]

@tatarsky
Copy link
Contributor Author

Gpu oversubscription not working right for cc27. Check rules to allow it even if all slots are filled batch.

@tatarsky tatarsky reopened this May 17, 2016
@tatarsky
Copy link
Contributor Author

OK. I think I fixed this now.

qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230098.hal-sched1.local to start
qsub: job 7230098.hal-sched1.local ready

[cc27 me ~]$ nvidia-smi
(stuff about the tesla)

Any tests that people could do to confirm this system is actually usable would be appreciated. It is currently usable I believe from gpu and batch queues.

@jchodera
Copy link
Member

I don't get cc27 when I request tesla---I erroneously get a titan

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230614.hal-sched1.local to start
qsub: job 7230614.hal-sched1.local ready

[chodera@gpu-1-16 ~]$ nvidia-smi
Wed May 18 16:51:03 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   34C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   31C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   37C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   37C    P8    14W / 250W |    295MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     22437    C   ...grlab/home/karaletsos/anaconda/bin/python   278MiB |
+-----------------------------------------------------------------------------+
[chodera@gpu-1-16 ~]$ 

@akahles
Copy link

akahles commented May 18, 2016

I think you requested telsa instead of tesla so the label might have been ignored.

@jchodera
Copy link
Member

Whoops---that's because I cut-and-paste a typo (telsa).

@jchodera
Copy link
Member

[chodera@mskcc-ln1 ~]$ qsub -I -l notes=1:ppn=1:tesla:gpus=1 -l mem=1gb
qsub: submit error (Unknown resource type  Resource_List.notes)

@akahles
Copy link

akahles commented May 18, 2016

Ok, sorry about that - I had assumed the label was meant to be tesla

@jchodera
Copy link
Member

I think it's supposed to be tesla. I'm just not sure how to get it to work.

@tatarsky
Copy link
Contributor Author

Your second example has "notes" instead of "nodes"

@tatarsky
Copy link
Contributor Author

My test still seems to take me there:

qsub -l nodes=1:ppn=1:gpus=1:shared:tesla -q gpu -I
qsub: waiting for job 7230761.hal-sched1.local to start
qsub: job 7230761.hal-sched1.local ready

[me@cc27 ~]$ 

@jchodera
Copy link
Member

Aha! In the queue now!

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start

@tatarsky
Copy link
Contributor Author

Sorry, i exiting.

@jchodera
Copy link
Member

Success!

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
qsub: job 7230765.hal-sched1.local ready

[chodera@cc27 ~]$ nvidia-smi
Wed May 18 17:11:47 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:81:00.0     Off |                    0 |
| 23%   37C    P0    67W / 235W |     22MiB / 11519MiB |     99%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@tatarsky
Copy link
Contributor Author

I show you on cc27 in "oversubscribe" slot "49" which is one more than the 48 batch jobs on there right now.

Am curious if the GPU infrastructure is right there though.

@tatarsky
Copy link
Contributor Author

As in Torque GPU variables and such. Been awhile since I looked at all that.

@jchodera
Copy link
Member

Testing on the GPU...

@jchodera
Copy link
Member

CUDA_VISIBLE_DEVICES correct:

[chodera@cc27 ~]$ setenv | grep CUDA_
CUDA_VISIBLE_DEVICES=0

It works!

[chodera@cc27 examples]$ python benchmark.py --help
python be   Usage: benchmark.py [options]

Options:
  -h, --help            show this help message and exit
  --platform=PLATFORM   name of the platform to benchmark
  --test=TEST           the test to perform: gbsa, rf, pme, amoebagk, or
                        amoebapme [default: all]
  --pme-cutoff=CUTOFF   direct space cutoff for PME in nm [default: 0.9]
  --seconds=SECONDS     target simulation length in seconds [default: 60]
  --polarization=POLARIZATION
                        the polarization method for AMOEBA: direct,
                        extrapolated, or mutual [default: mutual]
  --mutual-epsilon=EPSILON
                        mutual induced epsilon for AMOEBA [default: 1e-5]
  --heavy-hydrogens     repartition mass to allow a larger time step
  --device=DEVICE       device index for CUDA or OpenCL
  --precision=PRECISION
                        precision mode for CUDA or OpenCL: single, mixed, or
                        double [default: single]
[chodera@cc27 examples]$ python benchmark.py --platform=CUDA --test=pme --seconds=60
Platform: CUDA
Precision: single

Test: pme (cutoff=0.9)
Step Size: 2 fs
Integrated 35138 steps in 60.0692 seconds
101.081 ns/day

@tatarsky
Copy link
Contributor Author

OK cool. I need to validate that bash/sh CUDA_VISIBLE_DEVICES works. But I believe the queue infrastructure is now correct for batch and gpu.

@tatarsky
Copy link
Contributor Author

Looked ok to me. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants