add Tensorflow test #38

casparvl · 2023-05-22T18:05:43Z

No description provided.

…orflow

…nted into functions. Also separated the training (which is timed now) and evaluation (which is not timed). Clearly print computational performance and accuracy at the end, to make it easy for a ReFrame test to pick up in sanity and performance functions.

…sks etc is still hard-coded. Also still a todo: make sure that proper binding is used

…ograms like TensorFlow

…y machine that results in all threads being bound to the first core of the allocation, rather than one thread per core

…o process binding for TF

casparvl · 2023-06-19T11:45:53Z

The main issue with this PR currently is twofold:

All threads are incorrectly bound to the first core in the allocation as soon as either OMP_PLACES or OMP_PROC_BIND is set. For details, see OMP_PROC_BIND or OMP_PLACES either ignored or respected incorrectly tensorflow/tensorflow#60843
On a multi-GPU run, one of the ranks is executing a massive amount on the CPU (instead of GPU), while the GPU seems underutilized. This problem is exacerbated by binding, leading to a slowdown.

Note that I've been testing this with our locally built TensorFlow module, TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0.

Issue 1

For details, see tensorflow/tensorflow#60843

Issue 2

Running

mpirun -np 4 --bind-to none python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

I see quite a lot of CPU useage. Note that I upped the batch size to 4096 to get the GPUs to >90% utilization. Although not very clearly visible from the screenshot, pretty much all of the CPU activity is from the first python process (the one corresponding to rank 0):

From the screenshot, it is also clear that one of the GPUs seems to be underutilized. The one that is underutilized also seems to change per epoch

Total throughput for this run is 324557.2348381291 img/s.

Note that when using 1 GPU:

mpirun -np 1 --bind-to none python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

I don't see this high CPU utilization. I guess it is related to the communication with the other workers (maybe wait time?).

For reference, this run resulted in Performance: 305804.27216238284 img/s. Note that this means hardly any increase in throughput with 4 GPUs. This could be due to the management overhead or because it is bottlenecked by this one GPU that is not well utilized?

The high CPU utilization wouldn't be a huge issue if I didn't see a reduced performance when e.g. binding:

mpirun -np 4 --map-by node:PE=18 python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

Utilization is clearly less

and indeed perfromance goes down to Performance: 182420.67007894177 img/s.

casparvl · 2023-06-19T14:49:00Z

I've tried to reproduce this with tf-nightly. For some reason, tf-nightly seems to hang in NCCL initialization. Turning off NCCL as communication option:

#   communication_options = tf.distribute.experimental.CommunicationOptions(
#       implementation=tf.distribute.experimental.CommunicationImplementation.NCCL)
#   strategy = tf.distribute.MultiWorkerMirroredStrategy(communication_options=communication_options)
    strategy = tf.distribute.MultiWorkerMirroredStrategy()

Makes the code run. Then, I see a nicely balanced load over the GPUs (though accupancy isn't great), and no strange high CPU usage for the first rank:

Performance is pretty similar to before Performance: 338331.0040426434 img/s .

Let's see if I can replicate this with my TensorFlow module if I turn of NCCL:

Ugh, I guess that's a no: still seeing the high CPU load, and GPU utilization is terrible If I don't select the NCCL communicator. Unsuprisingly, performance is terrible too: Performance: 95309.03952265615 img/s

Of course, this could still be a version difference: nightly is somewhere beyond the latest release, which is version 2.12, whereas my local module is 2.11.

Ugh... TensorFlow is so complicated - and multinode TensorFlow doubly so.

casparvl · 2023-06-19T15:04:21Z

From a pip-installed version 2.11 of TensorFlow, running:

XLA_FLAGS="--xla_gpu_cuda_data_dir=$CUDA_HOME" mpirun -np 4 --map-by node:PE=18 python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

Performance is similar to the unbound case we had before: Performance: 313995.7833386784 img/s. However, here, the processes are bound, and we don't see this rediculously high CPU usage.

Note that this is without NCCL as communication_option. With NCCL, I see the same strange behaviour of unbalanced GPU usage (still no high CPU usage though):

Performance is slightly higher than for other cases: Performance: 362982.86494292156 img/s

It makes me wonder: maybe the high GPU utilization on some GPUs simply indicates wait-cycles on the GPU. The GPU with the lowest utilization is simply the bottleneck in that epoch, and the rest is waiting for it? I'm not sure, just a theory.

satishskamath · 2023-06-19T15:08:29Z

If you bypass NCCL, then does it also bypass NVlink and does all the communication via host? That could be the reason for the worse performance with TF nightly. Regarding Utilization from 1-4 GPUs that is a bigger mystery. A deeper profile is required.

casparvl · 2023-06-19T15:10:37Z

That could be the reason for the worse performance with TF nightly

Tf-nightly (without NCCL) performs pretty decent actually, with around 330k img/s. It was our local TF module without NCCL that was terrible :P

…ESSI#38 , e.g. EESSI#38 (comment)

casparvl · 2023-06-19T15:29:40Z

The more I think about it, the more I believe the test is actually done. It currently does:

Process binding, but not thread binding, to avoid issue (1). This is a pretty reasonable default. It might not give the absolute highest performance, but should be fairly decent in all cases. And: this is a test, not a benchmark, we don't need it to give the absolute highest performance achievable.
NCCL communication on GPU. With process binding + high CPU load, this actually shows sub-par performance. However, that is exactly what this test should be doing: it shows that something is off with our module installation. If anything, that makes it a good test.

Maybe its simply time to take the next step, and have some others test it on their TensorFlow module. Unfortunately, there is no GPU-capable TF module in EESSI yet, so I can't test with that (and maybe worrying about the GPU part of this test is a bit premature anyway). @smoors @boegel any chance you could run this on your own clusters, with your own TensorFlow module, and see if you see similar things that I see above (i.e. high CPU utilization for rank 0, somewhat sub-par performance)?

casparvl · 2023-06-19T16:40:54Z

Hm, I overlooked something in the hooks:

[     FAIL ] ( 1/33) TENSORFLOW_EESSI %scale=1_core %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /af8226d5 @snellius:thin+default
==> test failed during 'run': test staged in '/scratch-shared/casparl/reframe_output/staging/snellius/thin/default/TENSORFLOW_EESSI_af8226d5'
[     FAIL ] ( 2/33) TENSORFLOW_EESSI %scale=1_core %module_name=TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 %device_type=cpu /7feefe35 @snellius:thin+default
==> test failed during 'run': test staged in '/scratch-shared/casparl/reframe_output/staging/snellius/thin/default/TENSORFLOW_EESSI_7feefe35'

The jobscript looks like this:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=0
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=0:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=0
export SLURM_CPU_BIND=q
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 0 --inter-op-parallelism 1

The issue here is that it is still generating a fixed number of tasks per node, which is equal to the socket count. The cpus-per-task is then calculated by taking the default cpu count (1 for this scale) and deviding by the task count, rounding down.
The error is of course the task-per-node count, which should simply be 1 for these cases. I'll fix this in the hook.

casparvl · 2023-06-19T16:45:59Z

Yep, that looks better. For 1_core:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=1:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=1
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 1 --inter-op-parallelism 1

for 2_cores:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a
export I_MPI_PIN_DOMAIN=2:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=2
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 2 --inter-op-parallelism 1

for 1_8 node:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=16:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=16
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 16 --inter-op-parallelism 1

for 1/2 node:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

So, this seems to work for both core count and node part specification by SCALES

…etwork learned something

…rdinated between workers. This would result in lines being broken off and sanity patterns not matching. All printing is now done by rank 0

casparvl · 2023-06-26T13:45:05Z

Ugh, ran it on Vega, and got:

--- rfm_job.err (first 10 lines) ---
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE:IF-SUPPORTED
   Node:        cn0768
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
--- rfm_job.err ---

I know where this is coming from: Vega has hyperthreading enabled. Thus, their processor config looks like this:

...
  "num_cpus": 256,
  "num_cpus_per_core": 2,
  "num_cpus_per_socket": 128,
  "num_sockets": 2

We should really make the hooks that set the one-process-per-X process count aware of CPUs vs hwthreads. In case of one-process-per-CPU, it should really do one per physical CPU. That would probably also resolve this binding issue.

For context, the submitted job on Vega looks like this:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=128
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p cpu
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
export SLURM_EXPORT_ENV=ALL
module load TensorFlow/2.3.1-foss-2020a-Python-3.8.2
export I_MPI_PIN_DOMAIN=128:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=128
export SLURM_CPU_BIND=q
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 128 --inter-op-parallelism 1

casparvl · 2023-06-26T13:55:15Z

Oh, yet another challenge: I need to convince SLURM to ask for 128 'cpus' (-c 64), but in the mpirun command only launch 64 tasks...
I might be able to set both to 64 if I also set --threads-per-core=1... The question is: should the test do this? And if it's the test: how do I set this from ReFrame? Probably it has to be an options or extras or whatever it is called in ReFrame. What if a ReFrame config doesn't define that 'extra'? Will the test still run and ignore the request for extras? Otherwise, setting it in a test will mean essentially requiring that everyone sets this in their config...

casparvl · 2023-06-26T16:51:46Z

Ok, I think using cores as processing elements makes the most sense. That means changing the hook with:

    # If hyperthreading is enabled, we need to change our process binding
    num_cpus_per_core = test.current_partition.processor.num_cpus_per_core
    if num_cpus_per_core is None:
        raise AttributeError(PROCESSOR_INFO_MISSING)

    # Do binding for intel and OpenMPI's mpirun, and srun
    # Other launchers may or may not do the correct binding
    test.env_vars['I_MPI_PIN_CELL'] = 'core'  # Don't bind to hyperthreads, only to physcial cores
    test.env_vars['I_MPI_PIN_DOMAIN'] = '%s:compact' % test.num_cpus_per_task / num_cpus_per_core
    test.env_vars['OMPI_MCA_rmaps_base_mapping_policy'] = 'node:PE=%s' % test.num_cpus_per_task / num_cpus_per_core

By dividing through the num_cpus_per_core, we only number cores, instead of hardware threads.

We'll still need to figure out how this will then work for e.g. pure MPI programs. On hybrid systems, you'd want those to set test.num_tasks_per_node equal to physical number of cores (I suppose), and set test.num_cpus_per_task equal to two, in order to make sure that SLURM makes a large enough allocation (since SLURM does consider each hyperthread a 'cpu'). We'll worry about that later...

…er_compute_unit

casparvl · 2023-06-28T11:27:23Z

Fun fact: the EESSI TensorFlow version is too old for this test... Well, that's a temporary problem, I'm not going to change the test to fit our extremely old 2.3 version of TF :) One can test it on newer (local) modules....

casparvl · 2023-06-28T15:53:34Z

Ok, I checked on my system now if at least the binding and process spawning was done properly:

reframe -C test-suite/config/surf_snellius.py -c test-suite/eessi/testsuite/tests/apps/tensorflow/ -r -n /5b5c1cb6
...
[----------] start processing checks
[ RUN      ] TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /5b5c1cb6 @snellius:thin+default
[       OK ] (1/1) TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /5b5c1cb6 @snellius:thin+default
P: perf: 58705.772664812015 img/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped)

The generated job script looked like this:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=64
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
module load TensorFlow/2.11.0-foss-2022a
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=q
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

Which is as expected: we indeed have 2 sockets, so want 2 tasks. We have 64 cores per socket, so expect --cpus-per-task=64 as well as --intra-op-parallelism 64.

Checking the binding during the run shows the tasks where bound correctly to their own set of cores

[casparl@tcn359 ~]$ taskset -pc 3654012
pid 3654012's current affinity list: 64-127
[casparl@tcn359 ~]$ taskset -pc 3654011
pid 3654011's current affinity list: 0-63

I think this is ready to be reviewed :)

smoors

i tested this on cpu and gpu nodes with --exclusive set to make sure binding is correct.

it seems to work as intended, with tasks bound to sockets.
number of active threads is slightly higher than requested with cpus-per-task, which is expected for TF.

for the gpu node, it actually ran faster with 1 gpu than with 2, is that due to the small size of the system?

i did not check if the gpus were actually used

config/izum_vega.py

eessi/testsuite/tests/apps/tensorflow/tensorflow.py

eessi/testsuite/tests/apps/tensorflow/src/tf_test.py

eessi/testsuite/tests/apps/tensorflow/tensorflow.py

eessi/testsuite/hooks.py

Co-authored-by: Sam Moors <[email protected]>

More elegant way of retrieving local rank Co-authored-by: Sam Moors <[email protected]>

…orflow

Enable verbosity for SLURM binding Co-authored-by: Sam Moors <[email protected]>

…orflow

…binding, etc

smoors

lgtm
still some formatting issues, but we can deal with those later

casparvl · 2023-08-01T10:13:25Z

Yeah, we should do a PR fixing all formatting issues in this repo, and introducing a style check in the CI... It's the best way to ensure proper code style.

Note: I tried to do better by installing black in my VS code, but for some reason it doesn't work properly. I have a bit of a complicated setup with remote-ssh, so maybe that's why...

Caspar van Leeuwen and others added 8 commits May 22, 2023 18:24

Initial version of the python files for the TensorFlow test

ff57caa

Merge branch 'EESSI:main' into tensorflow

5b16306

Modify list of visible GPU devices to empty when argument is cpu

61c29ec

Merge branch 'tensorflow' of github.com:casparvl/test-suite into tens…

b466396

…orflow

Moved test files

5428ff5

Added ReFrame TensorFlow test. Still a work in progress: number of ta…

9d599c5

…sks etc is still hard-coded. Also still a todo: make sure that proper binding is used

Added support for launching one task per socket, useful for hybrid pr…

85fab0e

…ograms like TensorFlow

boegel mentioned this pull request Jun 5, 2023

TensorFlow and Horovod test EESSI/software-layer#122

Closed

Caspar van Leeuwen and others added 5 commits June 5, 2023 12:47

Added binding environment variables

7aab841

Set process binding. We commented out thread binding for now, as on m…

e9cab6b

…y machine that results in all threads being bound to the first core of the allocation, rather than one thread per core

merged with main, resolved conflicts due to renaming of namespace

bf40e9a

Made separate hooks for binding processes and binding threads. Only d…

09e7704

…o process binding for TF

Use tf.config.threading api to set number of threads

c063f64

Increased default batch size to correspond to the many tests run in E…

27bf9aa

…ESSI#38 , e.g. EESSI#38 (comment)

casparvl and others added 2 commits June 19, 2023 17:32

Merge branch 'main' into tensorflow

0ab5a39

Add logging to binding hooks

e682b7a

Make more sensible default behaviour for partial nodes in the SCALES

c359c3a

Caspar van Leeuwen added 2 commits June 20, 2023 16:15

Changed optimizer, faster convergence, so we can more sure that the n…

18a20ae

…etwork learned something

Sanity check for large node counts were failing since I/O was not coo…

79eebb7

…rdinated between workers. This would result in lines being broken off and sanity patterns not matching. All printing is now done by rank 0

Caspar van Leeuwen and others added 3 commits June 26, 2023 18:52

Fix process binding on hyperthreading enabled systems

8903e40

Define separate variable, comment, and reuse that

fae59b1

Add some support for systems with hyperthreading to assign_one_task_p…

c6d77ff

…er_compute_unit

Caspar van Leeuwen added 3 commits June 28, 2023 14:39

Trying to get hyperthreading to do something sensible. Not working yet

d52ae1a

Add some context

dca0a2e

Revert changes to make TF work on hyperthreading systems

1c023d1

casparvl marked this pull request as ready for review June 28, 2023 15:51

casparvl changed the title ~~WIP: Tensorflow~~ Tensorflow Jun 28, 2023

smoors reviewed Jul 2, 2023

View reviewed changes

Caspar van Leeuwen and others added 10 commits July 18, 2023 16:08

Should not have been in this PR, is part of another

358966e

Update eessi/testsuite/tests/apps/tensorflow/src/tf_test.py

16ddf31

Co-authored-by: Sam Moors <[email protected]>

Merge remote-tracking branch 'origin/main' into tensorflow

50d49a6

Update eessi/testsuite/tests/apps/tensorflow/src/tf_test.py

ea5414b

More elegant way of retrieving local rank Co-authored-by: Sam Moors <[email protected]>

Merge branch 'tensorflow' of github.com:casparvl/test-suite into tens…

932b2f3

…orflow

Update eessi/testsuite/hooks.py

03f37d0

Enable verbosity for SLURM binding Co-authored-by: Sam Moors <[email protected]>

Check if ReFrame config specifies the required processor attributes

178d70c

Merge branch 'tensorflow' of github.com:casparvl/test-suite into tens…

4416d08

…orflow

Implemented review comments: logging, calling a hook for the process …

e87e628

…binding, etc

Merged main into this branch, then resolved conflicts

ea561f4

casparvl requested a review from smoors July 28, 2023 15:48

Merge branch 'main' into tensorflow

9e1b5a2

smoors approved these changes Jul 31, 2023

View reviewed changes

smoors merged commit 313645f into EESSI:main Jul 31, 2023
9 checks passed

boegel mentioned this pull request Aug 9, 2023

add test for TensorFlow #2

Closed

boegel changed the title ~~Tensorflow~~ add Tensorflow test Sep 21, 2023

casparvl deleted the tensorflow branch September 4, 2024 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Tensorflow test #38

add Tensorflow test #38

casparvl commented May 22, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023 •

edited

Loading

casparvl commented Jun 19, 2023

satishskamath commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 26, 2023

casparvl commented Jun 26, 2023

casparvl commented Jun 26, 2023

casparvl commented Jun 28, 2023

casparvl commented Jun 28, 2023

smoors left a comment •

edited

Loading

smoors left a comment

casparvl commented Aug 1, 2023

add Tensorflow test #38

add Tensorflow test #38

Conversation

casparvl commented May 22, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023 • edited Loading

casparvl commented Jun 19, 2023

satishskamath commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 26, 2023

casparvl commented Jun 26, 2023

casparvl commented Jun 26, 2023

casparvl commented Jun 28, 2023

casparvl commented Jun 28, 2023

smoors left a comment • edited Loading

Choose a reason for hiding this comment

smoors left a comment

Choose a reason for hiding this comment

casparvl commented Aug 1, 2023

casparvl commented Jun 19, 2023 •

edited

Loading

smoors left a comment •

edited

Loading