-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tf example dask #30
Tf example dask #30
Changes from 13 commits
67eb3b7
9febbce
afe1217
fe6746f
f249666
4af4261
8f90524
5d31698
69fa1db
f8f93e5
8f9a31a
dba5c57
cf00644
7554159
dfbd8ff
45fde29
9917bbb
ae4a071
b6990e3
ebb0542
37b2959
e5aeaea
1b0a051
a52d125
8affb25
8083b22
f9e090a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
pip-wheel-metadata/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
import click | ||
from dask.distributed import Client | ||
from dask_jobqueue import SLURMCluster | ||
|
||
from mnist_example import train_dense_model | ||
|
||
|
||
@click.command() | ||
@click.option( | ||
'n_gpus', | ||
'-n', | ||
default=1, | ||
help='The number of GPUs on which to run the mnist examples in parallel. Defaults to 1.', | ||
type=int, | ||
) | ||
@click.option( | ||
'save', | ||
'-s', | ||
'--save', | ||
is_flag=True, | ||
help='Whether you want to save the models or not', | ||
) | ||
def launch_dask_tasks(n_gpus, save): | ||
assert 0 < n_gpus < 5, 'You need to request between 1 and 4 GPUs.' | ||
|
||
job_name = 'dask_mnist_tf_example' | ||
if n_gpus > 1: | ||
job_name += '_multi_gpus' | ||
|
||
cluster = SLURMCluster( | ||
n_workers=n_gpus, | ||
cores=n_gpus, | ||
job_cpu=10, | ||
memory='10GB', | ||
job_name=job_name, | ||
walltime='1:00:00', | ||
interface='ib0', | ||
job_extra=[ | ||
f'--gres=gpu:{n_gpus}', | ||
'--qos=qos_gpu-dev', | ||
'--distribution=block:block', | ||
'--hint=nomultithread', | ||
'--output=%x_%j.out', | ||
], | ||
env_extra=[ | ||
'module purge', | ||
'module load tensorflow-gpu/py3/2.1.0', | ||
], | ||
extra=[f'--resources GPU={n_gpus}'], | ||
) | ||
|
||
print(cluster.job_script()) | ||
|
||
client = Client(cluster) | ||
futures = [client.submit( | ||
# function to execute | ||
train_dense_model, | ||
# *args | ||
None, save, | ||
# this function has potential side effects | ||
pure=not save, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Out of curiosity again, can you elaborate a bit more why this function has potential side-effects ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so I took the definition of a pure function from here, and in their understanding, a side-effect could be something like writing a log (or this case the weights of an NN). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK fair enough, I think you read the doc too much ... But now I am a bit confused, I thought the function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yes this is purely for example's sake, in real life you would call it with different arguments indeed, I just didn't want to complicate things too much but I can indeed try and have different arguments if you think it'd be clearer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Different arguments would be better IMO because it would look more like a real-life example and easier to relate to as a DL practicioner |
||
resources={'GPU': 1}, | ||
) for _ in range(n_gpus)] | ||
job_result = client.gather(futures) | ||
if all(job_result): | ||
print('All jobs finished without errors') | ||
else: | ||
print('One job errored out') | ||
print('Shutting down dask workers') | ||
|
||
|
||
if __name__ == '__main__': | ||
launch_dask_tasks() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ | |
#SBATCH --time=3:00:00 # maximum execution time (HH:MM:SS) | ||
#SBATCH --output=tf_mnist%j.out # output file name | ||
#SBATCH --error=tf_mnist%j.out # error file name | ||
#SBATCH --qos=qos_gpu-t4 # it's a testy job | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no I didn't mean that haha, more test-y like relating to tests in an informal fashion, Will change to something better suited indeed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how does that work for you? |
||
|
||
set -x | ||
cd $WORK/jean-zay-doc/examples/tf | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about
distribution
andnomultithread
below, have you ever tried to not use them and see what happens ?My guess is that it should not make a huge difference but we copy and paste it from the Jean Zay documentation because we are not sure ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is my feeling too.
I feel like these examples are a bit too simple to test that though, like some bad behaviours could happen when running longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way I asked by email and they told me that
--distribution=block:block
can be removed now. On the other hand--hint=nomultithread
is necessary ...