Skip to content

stfxecutables/cedar_slow_test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Set of Scripts to Reproduce Cedar Slowness Issues

To run, clone the repo to somewhere on Cedar $SCRATCH, and cd. Then

time bash install.sh

All this does is create a virtual environment in .venv and install torch and torchvision, along with some other extremely standard scientific python libs. It can take anywhere from 5-20 minutes to install the .venv on Cedar (likely because many files must be written during this process). This is also evidenced in that setup tends to get stuck for a long time at:

Installing collected packages: iniconfig, typing_extensions, tqdm, tomli, threadpoolctl, pyparsing, py, pluggy, pillow-simd, numpy, joblib, attrs, torch, scipy, packaging, torchvision, scikit-learn, pytest

If you try the same install script on Graham, .venv install will almost always be done in <5 minutes

The subsequent tar -cf venv.tar .venv archiving will run in about a minute or two on Graham, but could take forever (1+ hour) on Cedar. Again, running tar only requires many file reads, so this indicates the problem is there.

Before running the next code, feel free to read through really_slow.py. It does basically nothing, and would run in less than a second on your local machine. On Cedar, runtimes can be anywhere from 5 to 20 minutes. If you comment out unused imports, it runs faster, on average, but the more imports you have, the slower it goes. You can see time estimates by running:

bash slow_script.sh

You should see output like:

real	1m57.432s
user	0m2.235s
sys	0m1.133s

To see how random this is, just run a bash / zsh loop to repeat this:

for x in 1 2 3 4 5 6 7 8 9 10; do bash slow_script.sh; done;

The results are highly random and depend partly on the time of day, but here is a sample output of the above, when run on a login-node:

real	1m57.432s
user	0m2.235s
sys	0m1.133s

real	0m3.642s
user	0m1.982s
sys	0m0.811s

real	0m3.640s
user	0m2.066s
sys	0m0.793s

real	0m2.841s
user	0m1.656s
sys	0m0.619s

real	0m3.423s
user	0m2.048s
sys	0m0.770s

real	2m25.478s
user	0m2.325s
sys	0m1.195s

real	2m43.339s
user	0m2.237s
sys	0m1.228s

real	0m3.731s
user	0m1.979s
sys	0m0.825s

real	0m3.302s
user	0m1.809s
sys	0m0.742s

real	0m3.277s
user	0m1.846s
sys	0m0.755s

You can try this in a Cedar interactive-job too, and the results will be just as bad, maybe worse:

$ salloc --time=1:0:0 --ntasks=4 --mem-per-cpu=16G --account=def-jlevman

  salloc: Pending job allocation 36126573
  salloc: job 36126573 queued and waiting for resources
  salloc: job 36126573 has been allocated resources
  salloc: Granted job allocation 36126573
  salloc: Waiting for resource configuration
  salloc: Nodes cdr861 are ready for job

[dberger@cdr861 cedar_slow_test]$ for x in 1 2 3 4 5 6 7 8 9 10; do bash slow_script.sh; done;

  real	4m47.400s
  user	0m2.383s
  sys	0m2.722s

  real	4m56.302s
  user	0m2.306s
  sys	0m2.699s

  real	5m44.728s
  user	0m2.284s
  sys	0m2.544s

  real	6m50.440s
  user	0m2.253s
  sys	0m2.730s

  real	7m38.160s
  user	0m2.182s
  sys	0m2.409s

  real	7m52.240s
  user	0m2.204s
  sys	0m2.439s

  real	7m40.668s
  user	0m2.257s
  sys	0m2.292s

  real	7m27.859s
  user	0m2.264s
  sys	0m2.531s

  real	5m50.572s
  user	0m2.239s
  sys	0m2.389s
  salloc: Job 36126573 has exceeded its time limit and its allocation has been revoked

  slurmstepd: error: *** STEP 36126573.interactive ON cdr861 CANCELLED AT 2022-06-10T09:54:33 DUE TO TIME LIMIT ***
  srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
  srun: error: cdr861: task 0: Killed
  srun: launch/slurm: _step_signal: Terminating StepId=36126573.interactive
  srun: error: Timed out waiting for job step to complete

Maybe this is because in the interactive job we have to reach across to the login node to get our .venv files? But we can test this is not the case in an interactive job with still_slow.sh:

$ salloc --time=3:0:0 --ntasks=1 --mem-per-cpu=16G --account=def-jlevman

  salloc: Pending job allocation 36134385
  salloc: job 36134385 queued and waiting for resources
  salloc: job 36134385 has been allocated resources
  salloc: Granted job allocation 36134385

[dberger@cdr861 cedar_slow_test]$ bash still_slow.sh

  real	0m3.899s
  user	0m0.268s
  sys	0m3.014s
  Un-tarred .venv to temp dir. Time elapsed above.
  Copied single script file to temp dir. Time elapsed above.
  Now code is setup all in the temp directory:
  total 8.0K
  drwxr-x---  3 dberger dberger   41 Jun 10 10:14 .
  drwxr-xr-x 11 root    root     236 Jun 10 10:14 ..
  -rw-r-----  1 dberger dberger 5.9K Jun 10 10:14 really_slow.py
  drwxr-x---  4 dberger dberger   46 Jun 10 08:37 .venv

  real	8m29.602s
  user	0m1.898s
  sys	0m2.300s

  real	7m6.051s
  user	0m1.881s
  sys	0m2.490s

  real	6m8.209s
  user	0m1.842s
  sys	0m2.184s

  real	7m37.467s
  user	0m1.778s
  sys	0m2.081s

  real	6m24.056s
  user	0m1.877s
  sys	0m2.119s

  real	9m26.496s
  user	0m1.965s
  sys	0m2.275s

  real	6m47.930s
  user	0m1.957s
  sys	0m2.300s

^CTraceback (most recent call last):
  File "really_slow.py", line 9, in <module>
    import numpy
  File "/scratch/dberger/cedar_slow_test/.venv/lib/python3.8/site-packages/numpy/__init__.py", line 155, in <module>
    from . import random
  File "/scratch/dberger/cedar_slow_test/.venv/lib/python3.8/site-packages/numpy/random/__init__.py", line 180, in <module>
    from . import _pickle
  File "/scratch/dberger/cedar_slow_test/.venv/lib/python3.8/site-packages/numpy/random/_pickle.py", line 6, in <module>
    from ._generator import Generator
  File "<frozen importlib._bootstrap>", line 389, in parent
KeyboardInterrupt

# I terminated the job here, the point has been made. As usual, we are stuck in imports.

If anything, the runtimes even look worse. Interestingly, you can see the untar was actually quite fast, suggesting that it is reads, not writes that are the problem.

By contrast, on Graham, results of the still_slow.sh test are fast, as expected:

$ salloc --time=1:0:0 --ntasks=1 --mem-per-cpu=16GB --account=def-jlevman

  salloc: Pending job allocation 61912016
  salloc: job 61912016 queued and waiting for resources
  salloc: job 61912016 has been allocated resources
  salloc: Granted job allocation 61912016
  salloc: Waiting for resource configuration
  salloc: Nodes gra1136 are ready for job

[dberger@gra1136 cedar_slow_test]$ bash still_slow.sh

  real	0m3.089s
  user	0m0.206s
  sys	0m2.861s
  Un-tarred .venv to temp dir. Time elapsed above.
  Copied single script file to temp dir. Time elapsed above.
  Now code is setup all in the temp directory:
  total 20K
  drwxr-x---  3 dberger dberger 4.0K Jun 10 12:38 .
  drwxr-xr-x 36 root    root    4.0K Jun 10 12:37 ..
  -rw-r-----  1 dberger dberger 5.9K Jun 10 12:38 really_slow.py
  drwxr-x---  4 dberger dberger 4.0K Jun 10 12:22 .venv

  real	0m53.129s
  user	0m2.344s
  sys	0m1.873s

  real	0m2.657s
  user	0m1.467s
  sys	0m0.584s

  real	0m2.546s
  user	0m1.382s
  sys	0m0.576s

  real	0m2.524s
  user	0m1.405s
  sys	0m0.551s

  real	0m2.545s
  user	0m1.449s
  sys	0m0.517s

  real	0m2.573s
  user	0m1.435s
  sys	0m0.563s

  real	0m2.550s
  user	0m1.406s
  sys	0m0.563s

  real	0m2.530s
  user	0m1.418s
  sys	0m0.533s

  real	0m2.573s
  user	0m1.424s
  sys	0m0.550s

  real	0m2.633s
  user	0m1.426s
  sys	0m0.543s

About

Test slow performance on Cedar

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published