Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tf example dask #30

Merged
merged 27 commits into from
May 12, 2020
Merged

Conversation

zaccharieramzi
Copy link
Contributor

The tf mnist example done with dask for greater flexibility.
This implements #19 with the current mnist example.

I took advantage of this to add a gitignore, tell me if I should move it in the examples or even at the root.

I also modified the qos for the classical examples without dask in order to be able to run them faster.

I have tested the dask script in multi gpus setting and it appears to be working but I honestly don't know for sure if it will scale or if it's correct.
At this stage I am only (slightly) confident in the single gpu setting.

@zaccharieramzi zaccharieramzi requested a review from lesteve April 9, 2020 14:23
@lesteve
Copy link
Member

lesteve commented Apr 9, 2020

I took advantage of this to add a gitignore, tell me if I should move it in the examples or even at the root.

I did not event know you could have a .gitignore outside the root folder but yes move it to the root folder while you are at it

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good after a quick pass and is a lot better than what we currently have i.e. nothing ... thanks a lot for working on this 😍

I mostly have a few questions below

job_extra=[
f'--gres=gpu:{n_gpus}',
'--qos=qos_gpu-dev',
'--distribution=block:block',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about distribution and nomultithread below, have you ever tried to not use them and see what happens ?

My guess is that it should not make a huge difference but we copy and paste it from the Jean Zay documentation because we are not sure ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is my feeling too.
I feel like these examples are a bit too simple to test that though, like some bad behaviours could happen when running longer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I asked by email and they told me that --distribution=block:block can be removed now. On the other hand --hint=nomultithread is necessary ...

# *args
None, save,
# this function has potential side effects
pure=not save,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity again, can you elaborate a bit more why this function has potential side-effects ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I took the definition of a pure function from here, and in their understanding, a side-effect could be something like writing a log (or this case the weights of an NN).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair enough, I think you read the doc too much ... pure only matter if you are using .submit with the same function and the same arguments. Dask has some caching mechanism where it may not recompute (see https://docs.dask.org/en/latest/caching.html) the results.

But now I am a bit confused, I thought the function train_dense_model would be called with different parameters, e.g. learning rate and it seems like it is called with the same parameters ... is there a use case for this? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes this is purely for example's sake, in real life you would call it with different arguments indeed, I just didn't want to complicate things too much but I can indeed try and have different arguments if you think it'd be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different arguments would be better IMO because it would look more like a real-life example and easier to relate to as a DL practicioner

@@ -10,6 +10,7 @@
#SBATCH --time=3:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=tf_mnist%j.out # output file name
#SBATCH --error=tf_mnist%j.out # error file name
#SBATCH --qos=qos_gpu-t4 # it's a testy job
Copy link
Member

@lesteve lesteve Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by testy (you know more than me the different queue on Jean Zay) ? Maybe not this ?
image

So maybe this is a queue dedicated to testing (or something like this) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I didn't mean that haha, more test-y like relating to tests in an informal fashion, Will change to something better suited indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does that work for you?

examples/tf/mnist_submission_script_multi_gpus.slurm Outdated Show resolved Hide resolved
'module load tensorflow-gpu/py3/2.1.0',
],
)
cluster.scale(n_gpus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remarks:

  • use .scale(jobs=something) in general I find it a lot more natural to think in terms of jobs rather than in terms of Dask workers
  • also I find the .scale(n_gpus) a bit misleading ... I would use a parameters like n_jobs to make it clear that it could be different than the total number of GPUs used for example if we used SLURMCluster(cores=2, ...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done!

@zaccharieramzi zaccharieramzi requested a review from lesteve April 10, 2020 17:22
@zaccharieramzi
Copy link
Contributor Author

@lesteve I updated the PR with all your comments plus pushed some last corrections.

@lesteve
Copy link
Member

lesteve commented May 12, 2020

Thanks a lot for this, I am going to merge this one.

@lesteve lesteve merged commit 9d9d4ff into jean-zay-users:master May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants