Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tf example dask #30

Merged
merged 27 commits into from
May 12, 2020
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
67eb3b7
added dask script to repo
zaccharieramzi Apr 9, 2020
9febbce
separated mnist tf example in 2 parts : one click and the other not
zaccharieramzi Apr 9, 2020
afe1217
corrected typo in dask script
zaccharieramzi Apr 9, 2020
fe6746f
made mnist example faster when using dask
zaccharieramzi Apr 9, 2020
f249666
corrected job name and clarified submit in dask script
zaccharieramzi Apr 9, 2020
4af4261
added a python gitignore to this specific example
zaccharieramzi Apr 9, 2020
8f90524
added job check at the end of mnist
zaccharieramzi Apr 9, 2020
5d31698
made the dask script a click script
zaccharieramzi Apr 9, 2020
69fa1db
corrected typo in dask script
zaccharieramzi Apr 9, 2020
f8f93e5
corrected typos in dask script + changed job name in case of multi gpu
zaccharieramzi Apr 9, 2020
8f9a31a
added dask example and multi gpu example in the readme
zaccharieramzi Apr 9, 2020
dba5c57
corrected job name for multi gpu
zaccharieramzi Apr 9, 2020
cf00644
reduced qos for slurm script
zaccharieramzi Apr 9, 2020
7554159
moved the gitignore file to root
zaccharieramzi Apr 9, 2020
dfbd8ff
changed comment about qos in slurm scripts
zaccharieramzi Apr 9, 2020
45fde29
masked gpus in multi gpu setting
zaccharieramzi Apr 9, 2020
9917bbb
made dask script use multiple sbatch rather than multiple dask workers
zaccharieramzi Apr 9, 2020
ae4a071
changed n gpu to n jobs in the client submit to have something more u…
zaccharieramzi Apr 10, 2020
b6990e3
dded batch size as an arg to the training of the mnist example and us…
zaccharieramzi Apr 10, 2020
ebb0542
corrected dask script in batch sizes
zaccharieramzi Apr 10, 2020
37b2959
updated readme with new script organisation
zaccharieramzi Apr 10, 2020
e5aeaea
removed help in click argument
zaccharieramzi Apr 10, 2020
1b0a051
added module load exact python for the scheduler (and then the worker…
zaccharieramzi Apr 10, 2020
a52d125
removed the module loading from the dask extra because the python is …
zaccharieramzi May 7, 2020
8affb25
corrected the test partition to dev
zaccharieramzi May 7, 2020
8083b22
added some extra warning about python executable
zaccharieramzi May 7, 2020
f9e090a
Merge branch 'master' into tf-example-dask
lesteve May 12, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions examples/tf/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
24 changes: 22 additions & 2 deletions examples/tf/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Tensorflow example script

To run this script you will need to first install click in your environment.
To run the examples you will need to first install `click` in your environment.
```
module load python/3.7.5 &&\
pip install click
Expand All @@ -12,7 +12,27 @@ cd $WORK &&\
git clone https://github.com/jean-zay-users/jean-zay-doc.git
```

Finally you can just launch the batch job (single GPU) via:
## Classical examples

For the single GPU job you can do:
```
sbatch jean-zay-doc/examples/tf/mnist_submission_script.slurm
```

For the multi GPU job you can do:
```
sbatch jean-zay-doc/examples/tf/mnist_submission_script_multi_gpus.slurm
```

## Dask example

To run the dask example you will need to install `dask-jobqueue` in your environment additionally.
```
module load python/3.7.5 &&\
pip install click dask-jobqueue
```

You can then do:
```
python jean-zay-doc/examples/tf/dask_script.py
```
73 changes: 73 additions & 0 deletions examples/tf/dask_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import click
from dask.distributed import Client
from dask_jobqueue import SLURMCluster

from mnist_example import train_dense_model


@click.command()
@click.option(
'n_gpus',
'-n',
default=1,
help='The number of GPUs on which to run the mnist examples in parallel. Defaults to 1.',
type=int,
)
@click.option(
'save',
'-s',
'--save',
is_flag=True,
help='Whether you want to save the models or not',
)
def launch_dask_tasks(n_gpus, save):
assert 0 < n_gpus < 5, 'You need to request between 1 and 4 GPUs.'

job_name = 'dask_mnist_tf_example'
if n_gpus > 1:
job_name += '_multi_gpus'

cluster = SLURMCluster(
n_workers=n_gpus,
cores=n_gpus,
job_cpu=10,
memory='10GB',
job_name=job_name,
walltime='1:00:00',
interface='ib0',
job_extra=[
f'--gres=gpu:{n_gpus}',
'--qos=qos_gpu-dev',
'--distribution=block:block',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about distribution and nomultithread below, have you ever tried to not use them and see what happens ?

My guess is that it should not make a huge difference but we copy and paste it from the Jean Zay documentation because we are not sure ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is my feeling too.
I feel like these examples are a bit too simple to test that though, like some bad behaviours could happen when running longer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I asked by email and they told me that --distribution=block:block can be removed now. On the other hand --hint=nomultithread is necessary ...

'--hint=nomultithread',
'--output=%x_%j.out',
],
env_extra=[
'module purge',
'module load tensorflow-gpu/py3/2.1.0',
],
extra=[f'--resources GPU={n_gpus}'],
)

print(cluster.job_script())

client = Client(cluster)
futures = [client.submit(
# function to execute
train_dense_model,
# *args
None, save,
# this function has potential side effects
pure=not save,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity again, can you elaborate a bit more why this function has potential side-effects ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I took the definition of a pure function from here, and in their understanding, a side-effect could be something like writing a log (or this case the weights of an NN).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair enough, I think you read the doc too much ... pure only matter if you are using .submit with the same function and the same arguments. Dask has some caching mechanism where it may not recompute (see https://docs.dask.org/en/latest/caching.html) the results.

But now I am a bit confused, I thought the function train_dense_model would be called with different parameters, e.g. learning rate and it seems like it is called with the same parameters ... is there a use case for this? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes this is purely for example's sake, in real life you would call it with different arguments indeed, I just didn't want to complicate things too much but I can indeed try and have different arguments if you think it'd be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different arguments would be better IMO because it would look more like a real-life example and easier to relate to as a DL practicioner

resources={'GPU': 1},
) for _ in range(n_gpus)]
job_result = client.gather(futures)
if all(job_result):
print('All jobs finished without errors')
else:
print('One job errored out')
print('Shutting down dask workers')


if __name__ == '__main__':
launch_dask_tasks()
20 changes: 14 additions & 6 deletions examples/tf/mnist_example.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
# all taken from https://www.tensorflow.org/guide/keras/functional
import os

import click
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


@click.command()
@click.option(
Expand All @@ -20,8 +16,19 @@
is_flag=True,
help='Whether you want to save the model or not',
)
def train_dense_model_click(cuda_visible_devices, save):
return train_dense_model(cuda_visible_devices, save)


def train_dense_model(cuda_visible_devices, save):
# limit imports oustide the call to the function, in order to launch quickly
# when using dask
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

if cuda_visible_devices is not None:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = cuda_visible_devices
# model building
tf.keras.backend.clear_session() # For easy reset of notebook state.
Expand Down Expand Up @@ -55,6 +62,7 @@ def train_dense_model(cuda_visible_devices, save):
# saving
if save:
model.save(os.environ['SCRATCH'])
return True

if __name__ == '__main__':
train_dense_model()
train_dense_model_click()
1 change: 1 addition & 0 deletions examples/tf/mnist_submission_script.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#SBATCH --time=3:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=tf_mnist%j.out # output file name
#SBATCH --error=tf_mnist%j.out # error file name
#SBATCH --qos=qos_gpu-t4 # it's a testy job
Copy link
Member

@lesteve lesteve Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by testy (you know more than me the different queue on Jean Zay) ? Maybe not this ?
image

So maybe this is a queue dedicated to testing (or something like this) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I didn't mean that haha, more test-y like relating to tests in an informal fashion, Will change to something better suited indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does that work for you?


set -x
cd $WORK/jean-zay-doc/examples/tf
Expand Down
1 change: 1 addition & 0 deletions examples/tf/mnist_submission_script_multi_gpus.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#SBATCH --time=3:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=tf_mnist_multi_gpus%j.out # output file name
#SBATCH --error=tf_mnist_multi_gpus%j.out # error file name
#SBATCH --qos=qos_gpu-t4 # it's a testy job
zaccharieramzi marked this conversation as resolved.
Show resolved Hide resolved

set -x
cd $WORK/jean-zay-doc/examples/tf
Expand Down