-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Tensorflow test #38
Conversation
…nted into functions. Also separated the training (which is timed now) and evaluation (which is not timed). Clearly print computational performance and accuracy at the end, to make it easy for a ReFrame test to pick up in sanity and performance functions.
…sks etc is still hard-coded. Also still a todo: make sure that proper binding is used
…ograms like TensorFlow
…y machine that results in all threads being bound to the first core of the allocation, rather than one thread per core
…o process binding for TF
The main issue with this PR currently is twofold:
Note that I've been testing this with our locally built TensorFlow module, Issue 1 For details, see tensorflow/tensorflow#60843 Issue 2 Running
I see quite a lot of CPU useage. Note that I upped the batch size to 4096 to get the GPUs to >90% utilization. Although not very clearly visible from the screenshot, pretty much all of the CPU activity is from the first python process (the one corresponding to rank 0): From the screenshot, it is also clear that one of the GPUs seems to be underutilized. The one that is underutilized also seems to change per epoch Total throughput for this run is Note that when using 1 GPU:
I don't see this high CPU utilization. I guess it is related to the communication with the other workers (maybe wait time?). For reference, this run resulted in The high CPU utilization wouldn't be a huge issue if I didn't see a reduced performance when e.g. binding:
Utilization is clearly less and indeed perfromance goes down to |
I've tried to reproduce this with
Makes the code run. Then, I see a nicely balanced load over the GPUs (though accupancy isn't great), and no strange high CPU usage for the first rank: Performance is pretty similar to before Let's see if I can replicate this with my TensorFlow module if I turn of NCCL: Ugh, I guess that's a no: still seeing the high CPU load, and GPU utilization is terrible If I don't select the NCCL communicator. Unsuprisingly, performance is terrible too: Of course, this could still be a version difference: nightly is somewhere beyond the latest release, which is version Ugh... TensorFlow is so complicated - and multinode TensorFlow doubly so. |
From a pip-installed version
Performance is similar to the unbound case we had before: Note that this is without NCCL as communication_option. With NCCL, I see the same strange behaviour of unbalanced GPU usage (still no high CPU usage though): Performance is slightly higher than for other cases: It makes me wonder: maybe the high GPU utilization on some GPUs simply indicates wait-cycles on the GPU. The GPU with the lowest utilization is simply the bottleneck in that epoch, and the rest is waiting for it? I'm not sure, just a theory. |
If you bypass NCCL, then does it also bypass NVlink and does all the communication via host? That could be the reason for the worse performance with TF nightly. Regarding Utilization from 1-4 GPUs that is a bigger mystery. A deeper profile is required. |
Tf-nightly (without NCCL) performs pretty decent actually, with around 330k img/s. It was our local TF module without NCCL that was terrible :P |
The more I think about it, the more I believe the test is actually done. It currently does:
Maybe its simply time to take the next step, and have some others test it on their TensorFlow module. Unfortunately, there is no GPU-capable TF module in EESSI yet, so I can't test with that (and maybe worrying about the GPU part of this test is a bit premature anyway). @smoors @boegel any chance you could run this on your own clusters, with your own TensorFlow module, and see if you see similar things that I see above (i.e. high CPU utilization for rank 0, somewhat sub-par performance)? |
Hm, I overlooked something in the hooks:
The jobscript looks like this:
The issue here is that it is still generating a fixed number of tasks per node, which is equal to the socket count. The cpus-per-task is then calculated by taking the default cpu count (1 for this scale) and deviding by the task count, rounding down. |
Yep, that looks better. For 1_core:
for
for 1_8 node:
for 1/2 node:
So, this seems to work for both core count and node part specification by |
…etwork learned something
…rdinated between workers. This would result in lines being broken off and sanity patterns not matching. All printing is now done by rank 0
Ugh, ran it on Vega, and got:
I know where this is coming from: Vega has hyperthreading enabled. Thus, their
We should really make the hooks that set the For context, the submitted job on Vega looks like this:
|
Oh, yet another challenge: I need to convince SLURM to ask for 128 'cpus' ( |
Ok, I think using cores as processing elements makes the most sense. That means changing the hook with:
By dividing through the num_cpus_per_core, we only number cores, instead of hardware threads. We'll still need to figure out how this will then work for e.g. pure MPI programs. On hybrid systems, you'd want those to set |
Fun fact: the EESSI TensorFlow version is too old for this test... Well, that's a temporary problem, I'm not going to change the test to fit our extremely old 2.3 version of TF :) One can test it on newer (local) modules.... |
Ok, I checked on my system now if at least the binding and process spawning was done properly:
The generated job script looked like this:
Which is as expected: we indeed have 2 sockets, so want 2 tasks. We have 64 cores per socket, so expect Checking the binding during the run shows the tasks where bound correctly to their own set of cores
I think this is ready to be reviewed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tested this on cpu and gpu nodes with --exclusive set to make sure binding is correct.
it seems to work as intended, with tasks bound to sockets.
number of active threads is slightly higher than requested with cpus-per-task, which is expected for TF.
for the gpu node, it actually ran faster with 1 gpu than with 2, is that due to the small size of the system?
i did not check if the gpus were actually used
Co-authored-by: Sam Moors <[email protected]>
More elegant way of retrieving local rank Co-authored-by: Sam Moors <[email protected]>
Enable verbosity for SLURM binding Co-authored-by: Sam Moors <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
still some formatting issues, but we can deal with those later
Yeah, we should do a PR fixing all formatting issues in this repo, and introducing a style check in the CI... It's the best way to ensure proper code style. Note: I tried to do better by installing black in my VS code, but for some reason it doesn't work properly. I have a bit of a complicated setup with remote-ssh, so maybe that's why... |
No description provided.