Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to OpenMP/MPI Materials Based on August Pilot Feedback #188

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
69e3a8a
Post-workshop (DiRAC) changes inclusion
mehtaparabaci Aug 8, 2024
fda4c34
Add challenge IDs
steve-crouch Aug 9, 2024
b76e013
Resolved merge conflict in 06_non_blocking_communication.md
mehtaparabaci Dec 2, 2024
e2d9e8f
Episode 2 updates: removed redundant 'c++' mention and added pedantic…
mehtaparabaci Dec 2, 2024
fcaff2a
Episode 3 updates: improved private variables explanation, added nest…
mehtaparabaci Dec 2, 2024
478fbae
Episode 4 updates: added progress counter example with race condition…
mehtaparabaci Dec 2, 2024
8edc325
Episode 4 code updates: corrected indentation in race condition lock …
mehtaparabaci Dec 2, 2024
a5638f9
Episode 4 code updates: added single region example full code
mehtaparabaci Dec 2, 2024
21f135d
Episode 4 code updates: added matrix update example full code
mehtaparabaci Dec 2, 2024
f4bfb83
Episode 4 code updates: added barrier example code
mehtaparabaci Dec 2, 2024
6bb3338
Episode 5 updates: Fixed some typos
mehtaparabaci Dec 2, 2024
bbd94bc
Episode 3 updates: Fixed some typos and revised the text discussing d…
mehtaparabaci Dec 2, 2024
e83a850
Episode 4 updates: Fixed code examples, clarified rank independence, …
mehtaparabaci Dec 2, 2024
22960ce
Episode 5 updates: Corrected typos, clarified table terminology, addr…
mehtaparabaci Dec 2, 2024
e8128e0
Fixed typos for Episode 2-11
mehtaparabaci Dec 2, 2024
fa01357
Updated barriers code snippet and revised Reporting Progress challenge
mehtaparabaci Dec 9, 2024
3da2bf6
Updated barriers example
mehtaparabaci Dec 9, 2024
23c2ca0
Refactored code to use dynamic matrices
mehtaparabaci Dec 9, 2024
f69dc0b
Merge pull request #1 from mehtaparabaci/post-pilot-mpi
mehtaparabaci Dec 9, 2024
e26954e
Updated matrix-update example code
mehtaparabaci Dec 9, 2024
22267cd
Refactored barrier example to use omp_get_max_threads() and updated e…
mehtaparabaci Dec 9, 2024
d5d288e
Merge pull request #4 from mehtaparabaci/post-pilot-openmp
mehtaparabaci Dec 9, 2024
4e577d6
Fix markdown linting issues
mehtaparabaci Dec 10, 2024
6a31ae8
Merge branch 'post-pilot-openmp' with resolved markdown issues
mehtaparabaci Dec 10, 2024
cafae7d
Fix markdown linting issues for MPI
mehtaparabaci Dec 10, 2024
d05362a
Merge branch 'post-pilot-mpi' including markdown fixes
mehtaparabaci Dec 10, 2024
7347484
Fix markdown linting issues in hpc_parallel_intro
mehtaparabaci Dec 10, 2024
0d5c301
Update 04_synchronisation.md
mehtaparabaci Jan 8, 2025
dca45c6
Changed variable names for the data type example & updated answer for…
mehtaparabaci Jan 20, 2025
4d863dd
Fixed monospace issue and updated explanation for MPI_Send
mehtaparabaci Jan 21, 2025
83d49cf
Updated collective operations table
mehtaparabaci Jan 21, 2025
5dfb4d7
Fixed typo
mehtaparabaci Jan 21, 2025
9d5c40c
Fixed spelling issue
mehtaparabaci Jan 21, 2025
f32c3f7
Fixed tyo
mehtaparabaci Jan 21, 2025
0347cdb
Fixed typos
mehtaparabaci Jan 21, 2025
d577b2a
Updated phrasing
mehtaparabaci Jan 21, 2025
bf17230
Corrected spelling
mehtaparabaci Jan 21, 2025
f9151a6
Fixed linting issues
mehtaparabaci Jan 21, 2025
a8b05a8
Merge branch 'UNIVERSE-HPC:main' into main
mehtaparabaci Jan 21, 2025
9d0f4ac
Fixed linting issues
mehtaparabaci Jan 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 60 additions & 43 deletions high_performance_computing/hpc_mpi/02_mpi_api.md

Large diffs are not rendered by default.

412 changes: 232 additions & 180 deletions high_performance_computing/hpc_mpi/03_communicating_data.md

Large diffs are not rendered by default.

371 changes: 176 additions & 195 deletions high_performance_computing/hpc_mpi/04_point_to_point_communication.md

Large diffs are not rendered by default.

408 changes: 235 additions & 173 deletions high_performance_computing/hpc_mpi/05_collective_communication.md

Large diffs are not rendered by default.

274 changes: 139 additions & 135 deletions high_performance_computing/hpc_mpi/06_non_blocking_communication.md

Large diffs are not rendered by default.

368 changes: 368 additions & 0 deletions high_performance_computing/hpc_mpi/07-derived-data-types.md

Large diffs are not rendered by default.

933 changes: 0 additions & 933 deletions high_performance_computing/hpc_mpi/07_advanced_communication.md

This file was deleted.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
---
name: Optimising MPI Applications
dependsOn: [
high_performance_computing.hpc_mpi.09_porting_serial_to_mpi,
high_performance_computing.hpc_intro
]
tags: []
dependsOn: [high_performance_computing.hpc_intro, high_performance_computing.hpc_mpi.08_porting_serial_to_mpi]
tags: [mpi]
attribution:
- citation: >
"Introduction to the Message Passing Interface" course by the Southampton RSG
Expand All @@ -15,8 +12,8 @@ learningOutcomes:
- Describe and differentiate between strong and weak scaling.
- Test the strong and weak scaling performance of our MPI code.
- Use a profiler to identify performance characteristics of our MPI application.

---

Now we have parallelised our code, we should determine how well it performs.
Given the various ways code can be parallellised, the underlying scientific implementation,and the type and amount of data the code is expected to process,
the performance of different parallelised code can vary hugely under different circumstances,
Expand All @@ -31,7 +28,8 @@ Therefore, it's really helpful to understand how well our code *scales* in perfo

## Prerequisite: [Intro to High Performance Computing](../hpc_intro/01_hpc_intro)

Whilst the previous episodes can be done on a laptop or desktop, this episode covers how to profile your code using tools that are only available on a HPC cluster.
Whilst the previous episodes can be done on a laptop or desktop, this episode covers how to profile your code using tools
that are only available on an HPC cluster.
::::

## Characterising the Scalability of Code
Expand All @@ -51,20 +49,20 @@ Ideally, we would like software to have a linear speedup that is equal to the nu
(speedup = N), as that would mean that every processor would be contributing 100% of its computational power.
Unfortunately, this is a very challenging goal for real applications to attain,
since there is always an overhead to making parallel use of greater resources.
In addition, in a program there is always some portion of it which must be executed in serial (such as initialisation routines, I/O operations and inter-communication) which cannot be parallelised.
This limits how much a program can be speeded up,
as the program will always take at least the length of the serial portion.
In addition, in a program there is always some portion of it which must be executed in serial (such as initialisation
routines, I/O operations and inter-communication) which cannot be parallelised.
This limits how much a program can be sped up, as the program will always take at least the length of the serial portion.

### Amdahl's Law and Strong Scaling

There is a theoretical limit in what parallelization can achieve, and it is encapsulated in "Amdahl's Law":
There is a theoretical limit in what parallelisation can achieve, and it is encapsulated in "Amdahl's Law":

$$ \mathrm{speedup} = 1 / (s + p / N) $$

Where:

- $$s$$ is the proportion of execution time spent on the serial part
- $$p$$ is the proportion of execution time spent on the part that can be parallelized
- $$p$$ is the proportion of execution time spent on the part that can be parallelised
- $$N$$ is the number of processors

Amdahl’s law states that, for a fixed problem, the upper limit of speedup is determined by the serial fraction of the code - most real world applications have some serial portion or unintended delays (such as communication overheads) which will limit the code’s scalability.
Expand All @@ -77,7 +75,7 @@ Amdahl’s law states that, for a fixed problem, the upper limit of speedup is d
## Amdahl's Law in Practice

Consider a program that takes 20 hours to run using one core.
If a particular part of the rogram, which takes one hour to execute, cannot be parallelized (s = 1/20 = 0.05), and if the code that takes up the remaining 19 hours of execution time can be parallelized (p = 1 − s = 0.95), then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour.
If a particular part of the program, which takes one hour to execute, cannot be parallelised (s = 1/20 = 0.05), and if the code that takes up the remaining 19 hours of execution time can be parallelised (p = 1 − s = 0.95), then regardless of how many processors are devoted to a parallelised execution of this program, the minimum execution time cannot be less than that critical one hour.
Hence, the theoretical speedup is limited to at most 20 times (when N = ∞, speedup = 1/s = 20).
::::

Expand Down Expand Up @@ -133,10 +131,11 @@ The figure below shows an example of the scaling with `GRIDSIZE=512` and `GRIDSI
![Figure showing the result described above for `GRIDSIZE=512` and `GRIDSIZE=2048`](fig/poisson_scaling_plot.png)

In the example, which runs on a machine with two 20-core Intel Xeon Scalable CPUs, using 32 ranks actually takes more time.
The 32 ranks don't fit on one CPU and communicating between the the two CPUs takes more time, even though they are in the same machine.
The 32 ranks don't fit on one CPU and communicating between the two CPUs takes more time, even though they are in the same machine.

The communication could be made more efficient.
We could use non-blocking communication and do some of the computation while communication is happening.

::::
:::::

Expand Down Expand Up @@ -187,12 +186,11 @@ The increase in runtime is probably due to the memory bandwidth of the node bein

The other significant factors in the speed of a parallel program are communication speed and latency.

Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying hardware for the communication.
Latency consists of the software latency (how long the operating system needs in order to prepare for a communication), and the hardware latency (how long the hardware takes to
send/receive even a small bit of data).
Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying
hardware for the communication. Latency consists of the software latency (how long the operating system needs in order to prepare
for a communication), and the hardware latency (how long the hardware takes to send/receive even a small bit of data).

For a fixed-size problem, the time spent in communication is not significant when the number of ranks is small and the execution of parallel regions gets faster with the number of ranks.
But if we keep increasing the number of ranks, the time spent in communication grows when multiple cores are involved with communication.
For a fixed-size problem, the time spent in communication is not significant when the number of ranks is small and the execution of parallel regions gets faster with the number of ranks. But if we keep increasing the number of ranks, the time spent in communication grows when multiple cores are involved with communication.

### Surface-to-Volume Ratio

Expand All @@ -202,7 +200,7 @@ The whole data which a CPU or a core computes is the sum of the two. The data un
The surface data requires communications.
he more surface there is, the more communications among CPUs/cores is needed, and the longer the program will take to finish.

Due to Amdahl's law, you want to minimize the number of communications for the same surface since each communication takes finite amount of time to prepare (latency).
Due to Amdahl's law, you want to minimise the number of communications for the same surface since each communication takes finite amount of time to prepare (latency).
This suggests that the surface data be exchanged in one communication if possible, not small parts of the surface data exchanged in multiple communications.
Of course, sequential consistency should be obeyed when the surface data is exchanged.

Expand All @@ -212,10 +210,10 @@ Now we have a better understanding of how our code scales with resources and pro
But we should be careful!

> "We should forget about small efficiencies, say about 97% of the time:
> premature optimization is the root of all evil." -- Donald Knuth
> premature optimisation is the root of all evil." -- Donald Knuth

Essentially, before attempting to optimize your own code, you should profile it.
Typically, most of the runtime is spent in a few functions/subroutines, so you should focus your optimization efforts on those parts of the code.
Essentially, before attempting to optimise your own code, you should profile it.
Typically, most of the runtime is spent in a few functions/subroutines, so you should focus your optimisation efforts on those parts of the code.
The good news is that there are helpful tools known as *profilers* that can help us.

Profilers help you find out where a program is spending its time and pinpoint places where optimising it makes sense.
Expand Down Expand Up @@ -247,7 +245,7 @@ For more information on ARM Forge see the [product website](https://www.arm.com/

## Software Availability

The ARM Forge suite of tools are licensed, and so may or may not be available on your HPC cluster (and certainly won't be on your laptop or desktop unless you buy a license and build them yourself!).
The ARM Forge suite of tools is licensed, and so may or may not be available on your HPC cluster (and certainly won't be on your laptop or desktop unless you buy a license and build them yourself!).

If you don't have access to the ARM Forge tools, your local HPC cluster should have an alternative installed with similar functionality.
::::
Expand All @@ -258,18 +256,14 @@ Ordinarily when profiling our code using such a tool, it is advisable to create
Fortunately that's something we can readily configure with our `poisson_mpi.c` code.
For now, set `MAX_ITERATIONS` to `25000` and `GRIDSIZE` to `512`.

We first load the module for Performance Reports.
How you do this will vary site-to-site, but for COSMA on DiRAC we can do:
We first load the module for Performance Reports. Remember that how you do this will vary site-to-site.

```bash
module load armforge/23.1.0
module load allinea/ddt/18.1.2
```

Next, we run the executable using Performance Reports
to analyse the program execution.
Create a new version of our SLURM submission script we used before,
which includes the following at the bottom of the script instead:
Next, we run the executable using Performance Reports to analyse the program execution. Create a new version of our SLURM submission script we used before, which includes the following at the bottom of the script instead:

```bash
module load armforge/23.1.0
Expand Down Expand Up @@ -328,15 +322,14 @@ spent in the actual compute sections of the code.
Compile, run and analyse your own MPI version of the poisson code.

How closely does it match the performance above? What are the main differences?
Try reducing the number of processes used, rerun and investigate the profile.
Is it still MPI-bound?
Try reducing the number of processes used, rerun and investigate the profile. Is it still MPI-bound?

Increase the problem size, recompile, rerun and investigate the profile.
What has changed now?
:::::

:::::challenge{id=iterative-improvement, title="Iterative Improvement"}
In the Poisson code, try changing the location of the calls to `MPI_Send`. How does this affect performance?
In the Poisson code, try changing the location of the calls to `MPI_Send()`. How does this affect performance?
:::::

::::callout{variant="tip"}
Expand Down
Loading
Loading