Tf example dask #30

zaccharieramzi · 2020-04-09T14:23:01Z

The tf mnist example done with dask for greater flexibility.
This implements #19 with the current mnist example.

I took advantage of this to add a gitignore, tell me if I should move it in the examples or even at the root.

I also modified the qos for the classical examples without dask in order to be able to run them faster.

I have tested the dask script in multi gpus setting and it appears to be working but I honestly don't know for sure if it will scale or if it's correct.
At this stage I am only (slightly) confident in the single gpu setting.

lesteve · 2020-04-09T14:25:56Z

I took advantage of this to add a gitignore, tell me if I should move it in the examples or even at the root.

I did not event know you could have a .gitignore outside the root folder but yes move it to the root folder while you are at it

lesteve

This looks good after a quick pass and is a lot better than what we currently have i.e. nothing ... thanks a lot for working on this 😍

I mostly have a few questions below

lesteve · 2020-04-09T14:27:16Z

examples/tf/dask_script.py

+        job_extra=[
+            f'--gres=gpu:{n_gpus}',
+            '--qos=qos_gpu-dev',
+            '--distribution=block:block',


I am curious about distribution and nomultithread below, have you ever tried to not use them and see what happens ?

My guess is that it should not make a huge difference but we copy and paste it from the Jean Zay documentation because we are not sure ...

yes that is my feeling too.
I feel like these examples are a bit too simple to test that though, like some bad behaviours could happen when running longer.

By the way I asked by email and they told me that --distribution=block:block can be removed now. On the other hand --hint=nomultithread is necessary ...

lesteve · 2020-04-09T14:29:39Z

examples/tf/dask_script.py

+        # *args
+        None, save,
+        # this function has potential side effects
+        pure=not save,


Out of curiosity again, can you elaborate a bit more why this function has potential side-effects ?

so I took the definition of a pure function from here, and in their understanding, a side-effect could be something like writing a log (or this case the weights of an NN).

OK fair enough, I think you read the doc too much ... pure only matter if you are using .submit with the same function and the same arguments. Dask has some caching mechanism where it may not recompute (see https://docs.dask.org/en/latest/caching.html) the results.

But now I am a bit confused, I thought the function train_dense_model would be called with different parameters, e.g. learning rate and it seems like it is called with the same parameters ... is there a use case for this? Am I missing something?

Oh yes this is purely for example's sake, in real life you would call it with different arguments indeed, I just didn't want to complicate things too much but I can indeed try and have different arguments if you think it'd be clearer.

Different arguments would be better IMO because it would look more like a real-life example and easier to relate to as a DL practicioner

lesteve · 2020-04-09T14:30:43Z

examples/tf/mnist_submission_script.slurm

@@ -10,6 +10,7 @@
 #SBATCH --time=3:00:00              # maximum execution time (HH:MM:SS)
 #SBATCH --output=tf_mnist%j.out # output file name
 #SBATCH --error=tf_mnist%j.out  # error file name
+#SBATCH --qos=qos_gpu-t4         # it's a testy job


What do you mean by testy (you know more than me the different queue on Jean Zay) ? Maybe not this ?

So maybe this is a queue dedicated to testing (or something like this) ?

no I didn't mean that haha, more test-y like relating to tests in an informal fashion, Will change to something better suited indeed.

how does that work for you?

examples/tf/mnist_submission_script_multi_gpus.slurm

lesteve · 2020-04-10T04:59:48Z

examples/tf/dask_script.py

+            'module load tensorflow-gpu/py3/2.1.0',
+        ],
+    )
+    cluster.scale(n_gpus)


A few remarks:

use .scale(jobs=something) in general I find it a lot more natural to think in terms of jobs rather than in terms of Dask workers

also I find the .scale(n_gpus) a bit misleading ... I would use a parameters like n_jobs to make it clear that it could be different than the total number of GPUs used for example if we used SLURMCluster(cores=2, ...).

sure, done!

…nderstandable

…ed it in the dask script instead of gpus

…), plus caveat

…going to be the same anyway

zaccharieramzi · 2020-05-07T10:27:57Z

@lesteve I updated the PR with all your comments plus pushed some last corrections.

lesteve · 2020-05-12T10:31:11Z

Thanks a lot for this, I am going to merge this one.

zaccharieramzi added 13 commits April 9, 2020 15:24

added dask script to repo

67eb3b7

separated mnist tf example in 2 parts : one click and the other not

9febbce

corrected typo in dask script

afe1217

made mnist example faster when using dask

fe6746f

corrected job name and clarified submit in dask script

f249666

added a python gitignore to this specific example

4af4261

added job check at the end of mnist

8f90524

made the dask script a click script

5d31698

corrected typo in dask script

69fa1db

corrected typos in dask script + changed job name in case of multi gpu

f8f93e5

added dask example and multi gpu example in the readme

8f9a31a

corrected job name for multi gpu

dba5c57

reduced qos for slurm script

cf00644

zaccharieramzi requested a review from lesteve April 9, 2020 14:23

moved the gitignore file to root

7554159

lesteve reviewed Apr 9, 2020

View reviewed changes

zaccharieramzi added 3 commits April 9, 2020 16:38

changed comment about qos in slurm scripts

dfbd8ff

masked gpus in multi gpu setting

45fde29

made dask script use multiple sbatch rather than multiple dask workers

9917bbb

lesteve reviewed Apr 10, 2020

View reviewed changes

zaccharieramzi added 6 commits April 10, 2020 10:40

changed n gpu to n jobs in the client submit to have something more u…

ae4a071

…nderstandable

dded batch size as an arg to the training of the mnist example and us…

b6990e3

…ed it in the dask script instead of gpus

corrected dask script in batch sizes

ebb0542

updated readme with new script organisation

37b2959

removed help in click argument

e5aeaea

added module load exact python for the scheduler (and then the worker…

1b0a051

…), plus caveat

zaccharieramzi requested a review from lesteve April 10, 2020 17:22

zaccharieramzi added 2 commits May 7, 2020 12:04

removed the module loading from the dask extra because the python is …

a52d125

…going to be the same anyway

corrected the test partition to dev

8affb25

added some extra warning about python executable

8083b22

Merge branch 'master' into tf-example-dask

f9e090a

lesteve merged commit 9d9d4ff into jean-zay-users:master May 12, 2020

lesteve mentioned this pull request May 17, 2020

Add simple example for dask-jobqueue #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tf example dask #30

Tf example dask #30

zaccharieramzi commented Apr 9, 2020

lesteve commented Apr 9, 2020

lesteve left a comment

lesteve Apr 9, 2020

zaccharieramzi Apr 9, 2020

lesteve May 12, 2020

lesteve Apr 9, 2020

zaccharieramzi Apr 9, 2020

lesteve Apr 10, 2020

zaccharieramzi Apr 10, 2020

lesteve Apr 10, 2020

lesteve Apr 9, 2020 •

edited

Loading

zaccharieramzi Apr 9, 2020

zaccharieramzi Apr 9, 2020

lesteve Apr 10, 2020

zaccharieramzi Apr 10, 2020

zaccharieramzi commented May 7, 2020

lesteve commented May 12, 2020

Tf example dask #30

Tf example dask #30

Conversation

zaccharieramzi commented Apr 9, 2020

lesteve commented Apr 9, 2020

lesteve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lesteve Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaccharieramzi commented May 7, 2020

lesteve commented May 12, 2020

lesteve Apr 9, 2020 •

edited

Loading