-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tf example dask #30
Tf example dask #30
Conversation
I did not event know you could have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good after a quick pass and is a lot better than what we currently have i.e. nothing ... thanks a lot for working on this 😍
I mostly have a few questions below
job_extra=[ | ||
f'--gres=gpu:{n_gpus}', | ||
'--qos=qos_gpu-dev', | ||
'--distribution=block:block', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about distribution
and nomultithread
below, have you ever tried to not use them and see what happens ?
My guess is that it should not make a huge difference but we copy and paste it from the Jean Zay documentation because we are not sure ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is my feeling too.
I feel like these examples are a bit too simple to test that though, like some bad behaviours could happen when running longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way I asked by email and they told me that --distribution=block:block
can be removed now. On the other hand --hint=nomultithread
is necessary ...
# *args | ||
None, save, | ||
# this function has potential side effects | ||
pure=not save, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity again, can you elaborate a bit more why this function has potential side-effects ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I took the definition of a pure function from here, and in their understanding, a side-effect could be something like writing a log (or this case the weights of an NN).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fair enough, I think you read the doc too much ... pure
only matter if you are using .submit
with the same function and the same arguments. Dask has some caching mechanism where it may not recompute (see https://docs.dask.org/en/latest/caching.html) the results.
But now I am a bit confused, I thought the function train_dense_model
would be called with different parameters, e.g. learning rate and it seems like it is called with the same parameters ... is there a use case for this? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes this is purely for example's sake, in real life you would call it with different arguments indeed, I just didn't want to complicate things too much but I can indeed try and have different arguments if you think it'd be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different arguments would be better IMO because it would look more like a real-life example and easier to relate to as a DL practicioner
@@ -10,6 +10,7 @@ | |||
#SBATCH --time=3:00:00 # maximum execution time (HH:MM:SS) | |||
#SBATCH --output=tf_mnist%j.out # output file name | |||
#SBATCH --error=tf_mnist%j.out # error file name | |||
#SBATCH --qos=qos_gpu-t4 # it's a testy job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I didn't mean that haha, more test-y like relating to tests in an informal fashion, Will change to something better suited indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does that work for you?
examples/tf/dask_script.py
Outdated
'module load tensorflow-gpu/py3/2.1.0', | ||
], | ||
) | ||
cluster.scale(n_gpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few remarks:
- use
.scale(jobs=something)
in general I find it a lot more natural to think in terms of jobs rather than in terms of Dask workers - also I find the
.scale(n_gpus)
a bit misleading ... I would use a parameters liken_jobs
to make it clear that it could be different than the total number of GPUs used for example if we usedSLURMCluster(cores=2, ...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done!
…ed it in the dask script instead of gpus
…going to be the same anyway
@lesteve I updated the PR with all your comments plus pushed some last corrections. |
Thanks a lot for this, I am going to merge this one. |
The tf mnist example done with dask for greater flexibility.
This implements #19 with the current mnist example.
I took advantage of this to add a gitignore, tell me if I should move it in the examples or even at the root.
I also modified the qos for the classical examples without dask in order to be able to run them faster.
I have tested the dask script in multi gpus setting and it appears to be working but I honestly don't know for sure if it will scale or if it's correct.
At this stage I am only (slightly) confident in the single gpu setting.