From 69e3a8ae76f381e952342a1307ca50270ff10d4f Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Thu, 8 Aug 2024 16:33:07 +0100 Subject: [PATCH 01/34] Post-workshop (DiRAC) changes inclusion The changes are covering OpenMP, MPI and Parallelisation sessions. --- .../hpc_mpi/02_mpi_api.md | 96 +- .../hpc_mpi/03_communicating_data.md | 404 ++++---- .../04_point_to_point_communication.md | 303 +++--- .../hpc_mpi/05_collective_communication.md | 363 +++---- .../hpc_mpi/06_non_blocking_communication.md | 261 +++-- .../hpc_mpi/07-derived-data-types.md | 366 +++++++ .../hpc_mpi/07_advanced_communication.md | 933 ------------------ ..._to_mpi.md => 08_porting_serial_to_mpi.md} | 143 +-- ...optimising_mpi.md => 09_optimising_mpi.md} | 45 +- ...tterns.md => 10_communication_patterns.md} | 202 ++-- .../hpc_mpi/11_advanced_communication.md | 586 +++++++++++ .../hpc_mpi/code/examples/02-count-primes.c | 12 +- high_performance_computing/hpc_mpi/index.md | 11 +- .../hpc_openmp/02_intro_openmp.md | 4 +- .../hpc_openmp/03_parallel_api.md | 37 +- .../hpc_openmp/04_synchronisation.md | 18 +- .../hpc_openmp/05_hybrid_parallelism.md | 43 +- .../hpc_openmp/index.md | 2 +- .../hpc_parallel_intro/01_introduction.md | 67 +- .../hpc_parallel_intro/index.md | 7 +- 20 files changed, 1955 insertions(+), 1948 deletions(-) create mode 100644 high_performance_computing/hpc_mpi/07-derived-data-types.md delete mode 100644 high_performance_computing/hpc_mpi/07_advanced_communication.md rename high_performance_computing/hpc_mpi/{09_porting_serial_to_mpi.md => 08_porting_serial_to_mpi.md} (77%) rename high_performance_computing/hpc_mpi/{10_optimising_mpi.md => 09_optimising_mpi.md} (91%) rename high_performance_computing/hpc_mpi/{08_communication_patterns.md => 10_communication_patterns.md} (75%) create mode 100644 high_performance_computing/hpc_mpi/11_advanced_communication.md diff --git a/high_performance_computing/hpc_mpi/02_mpi_api.md b/high_performance_computing/hpc_mpi/02_mpi_api.md index d8fdfd3a..561dde80 100644 --- a/high_performance_computing/hpc_mpi/02_mpi_api.md +++ b/high_performance_computing/hpc_mpi/02_mpi_api.md @@ -1,8 +1,7 @@ --- name: Introduction to the Message Passing Interface -dependsOn: [ -] -tags: [] +dependsOn: [] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -14,14 +13,13 @@ learningOutcomes: - Understand how to use the MPI API. - Learn how to compile and run MPI applications. - Use MPI to coordinate the use of multiple processes across CPUs. - --- ## What is MPI? MPI stands for ***Message Passing Interface*** and was developed in the early 1990s as a response to the need for a standardised approach to parallel programming. During this time, parallel computing systems were gaining popularity, featuring powerful machines with multiple processors working in tandem. However, the lack of a common interface for communication and coordination between these processors presented a challenge. -To address this challenge, researchers and computer scientists from leading vendors and organizations, including Intel, IBM, and Argonne National Laboratory, collaborated to develop MPI. Their collective efforts resulted in the release of the first version of the MPI standard, MPI-1, in 1994. This standardisation initiative aimed to provide a unified communication protocol and library for parallel computing. +To address this challenge, researchers and computer scientists from leading vendors and organisations, including Intel, IBM, and Argonne National Laboratory, collaborated to develop MPI. Their collective efforts resulted in the release of the first version of the MPI standard, MPI-1, in 1994. This standardisation initiative aimed to provide a unified communication protocol and library for parallel computing. ::::callout @@ -33,7 +31,7 @@ Since its inception, MPI has undergone several revisions, each introducing new f It formed the foundation for parallel programming using MPI. - **MPI-2 (1997):** This version expanded upon MPI-1 by introducing additional features such as dynamic process management, one-sided communication, paralell I/O, C++ and Fortran 90 bindings. MPI-2 improved the flexibility and capabilities of MPI programs. -- **MPI-3 (2012):** MPI-3 brought significant enhancements to the MPI standard, including support for non-blocking collectives, improved multithreading, and performance optimizations. +- **MPI-3 (2012):** MPI-3 brought significant enhancements to the MPI standard, including support for non-blocking collectives, improved multithreading, and performance optimisations. It also addressed limitations from previous versions and introduced fully compliant Fortran 2008 bindings. Moreover, MPI-3 completely removed the deprecated C++ bindings, which were initially marked as deprecated in MPI-2.2. - **MPI-4.0 (2021):** On June 9, 2021, the MPI Forum approved MPI-4.0, the latest major release of the MPI standard. @@ -42,15 +40,16 @@ Since its inception, MPI has undergone several revisions, each introducing new f These revisions, along with subsequent updates and errata, have refined the MPI standard, making it more robust, versatile, and efficient. :::: -Today, various MPI implementations are available, each tailored to specific hardware architectures and systems. Popular implementations like [MPICH](https://www.mpich.org/), [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.0tevpk), [IBM Spectrum MPI](https://www.ibm.com/products/spectrum-mpi), [MVAPICH](https://mvapich.cse.ohio-state.edu/) and [Open MPI](https://www.open-mpi.org/) offer optimized performance, portability, and flexibility. -For instance, MPICH is known for its efficient scalability on HPC systems, while Open MPI prioritizes extensive portability and adaptability, providing robust support for multiple operating systems, programming languages, and hardware platforms. +Today, various MPI implementations are available, each tailored to specific hardware architectures and systems. Popular implementations like [MPICH](https://www.mpich.org/), +[Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.0tevpk), +[IBM Spectrum MPI](https://www.ibm.com/products/spectrum-mpi), [MVAPICH](https://mvapich.cse.ohio-state.edu/) and +[Open MPI](https://www.open-mpi.org/) offer optimised performance, portability, and flexibility. +For instance, MPICH is known for its efficient scalability on HPC systems, while Open MPI prioritises extensive portability and adaptability, providing robust support for multiple operating systems, programming languages, and hardware platforms. The key concept in MPI is **message passing**, which involves the explicit exchange of data between processes. -Processes can send messages to specific destinations, broadcast messages to all processes, or perform collective operations where all processes participate. -This message passing and coordination among parallel processes are facilitated through a set of fundamental functions provided by the MPI standard. -Typically, their names start with `MPI_` and followed by a specific function or datatype identifier. Here are some examples: +Processes can send messages to specific destinations, broadcast messages to all processes, or perform collective operations where all processes participate. This message passing and coordination among parallel processes are facilitated through a set of fundamental functions provided by the MPI standard. Typically, their names start with `MPI_` and followed by a specific function or datatype identifier. Here are some examples: -- **MPI_Init:** Initializes the MPI execution environment. +- **MPI_Init:** Initialises the MPI execution environment. - **MPI_Finalize:** Finalises the MPI execution environment. - **MPI_Comm_rank:** Retrieves the rank of the calling process within a communicator. - **MPI_Comm_size:** Retrieves the size (number of processes) within a communicator. @@ -64,18 +63,14 @@ In the following episodes, we will explore these functions in more detail, expan In general, an MPI program follows a basic outline that includes the following steps: -1. ***Initialization:*** The MPI environment is initialized using the `MPI_Init` function. This step sets up the necessary communication infrastructure and prepares the program for message passing. -2. ***Communication:*** MPI provides functions for sending and receiving messages between processes. The `MPI_Send` function is used to send messages, while the `MPI_Recv` function is used to receive messages. -3. ***Termination:*** Once the necessary communication has taken place, the MPI environment is finalised using the `MPI_Finalize` function. This ensures the proper termination of the program and releases any allocated resources. +1. ***Initialisation:*** The MPI environment is initialised using the `MPI_Init()` function. This step sets up the necessary communication infrastructure and prepares the program for message passing. +2. ***Communication:*** MPI provides functions for sending and receiving messages between processes. The `MPI_Send()` function is used to send messages, while the `MPI_Recv()` function is used to receive messages. +3. ***Termination:*** Once the necessary communication has taken place, the MPI environment is finalised using the `MPI_Finalize()` function. This ensures the proper termination of the program and releases any allocated resources. ## Getting Started with MPI: MPI on HPC -As MPI codes allow you to run a code on multiple cores, we typically develop them to run on large systems like HPC clusters. -These are usually configured with versions of OpenMPI that have been optimised for the specific hardware involved, for maximum performance. - -For this episode, log into whichever HPC system you have access to - this could be a group server, or university- or national-level cluster (e.g. Iridis or DiRAC). +HPC clusters typically have **more than one version** of MPI available, so you may need to tell it which one you want to use before it will give you access to it. For this episode, log into whichever HPC system you have access to - this could be a group server, or university- or national-level cluster (e.g. Iridis, DiRAC or ARCHER2). -HPC clusters typically have **more than one version** of MPI available, so you may need to tell it which one you want to use before it will give you access to it. First check the available MPI implementations/modules on the cluster using the command below: ```bash @@ -83,13 +78,19 @@ module avail ``` This will display a list of available modules, including MPI implementations. -As for the next step, you should choose the appropriate MPI implementation/module from the list based on your requirements and load it using `module load `. -For example, if you want to load OpenMPI version 4.0.5, you can use: + +As for the next step, you should choose the appropriate MPI implementation/module from the list based on your requirements and load it using `module load `. For example, if you want to load OpenMPI version 4.0.5, you can use: ```bash module load openmpi/4.0.5 ``` +You may also need to load a compiler depending on your environment, and may get a warning as such. For example, you need to do something like this beforehand: + +```bash +module load gnu_comp/13.1.0 +``` + This sets up the necessary environment variables and paths for the MPI implementation and will give you access to the MPI library. If you are not sure which implementation/version of MPI you should use on a particular cluster, ask a helper or consult your HPC facility's documentation. @@ -126,9 +127,11 @@ HPC clusters don't usually have GUI-based IDEs installed on them. We can write code locally, and copy it across using `scp` or `rsync`, but most IDEs have the ability to open folders on a remote machine, or to automatically synchronise a local folder with a remote one. For **VSCode**, the [Remote-SSH](https://code.visualstudio.com/docs/remote/ssh) extension provides most of the functionality of a regular VSCode window, but on a remote machine. -Some older Linux systems don't support it - in that case, try the [SSH FS](https://marketplace.visualstudio.com/items?itemName=Kelvin.vscode-sshfs) extension instead. +Some older Linux systems don't support it - in that case, try the +[SSH FS](https://marketplace.visualstudio.com/items?itemName=Kelvin.vscode-sshfs) extension instead. -Other IDEs like **CLion** also support [a variety of remote development methods](https://www.jetbrains.com/help/clion/remote-development.html). +Other IDEs like **CLion** also support +[a variety of remote development methods](https://www.jetbrains.com/help/clion/remote-development.html). :::: ## Running a code with MPI @@ -192,7 +195,7 @@ Hello World! Just running a program with `mpiexec` creates several instances of our application. The number of instances is determined by the `-n` parameter, which specifies the desired number of processes. These instances are independent and execute different parts of the program simultaneously. Behind the scenes, `mpiexec` undertakes several critical tasks. -It sets up communication between the processes, enabling them to exchange data and synchronize their actions. +It sets up communication between the processes, enabling them to exchange data and synchronise their actions. Additionally, `mpiexec` takes responsibility for allocating the instances across the available hardware resources, deciding where to place each process for optimal performance. It handles the spawning, killing, and naming of processes, ensuring proper execution and termination. :::: @@ -205,24 +208,35 @@ As we've just learned, running a program with `mpiexec` or `mpirun` results in t mpirun -n 4 ./hello_world ``` -However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. -For the copies to work together, they need to know about their role in the computation, in order to properly take advantage of parallelisation. -This usually also requires knowing the total number of tasks running at the same time. +However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. For the copies to work together, they need to know about their role in the computation, in order to properly take advantage of parallelisation. This usually also requires knowing the total number of tasks running at the same time. -- The program needs to call the `MPI_Init` function. -- `MPI_Init` sets up the environment for MPI, and assigns a number (called the *rank*) to each process. -- At the end, each process should also cleanup by calling `MPI_Finalize`. +- The program needs to call the `MPI_Init()` function. +- `MPI_Init()` sets up the environment for MPI, and assigns a number (called the *rank*) to each process. +- At the end, each process should also cleanup by calling `MPI_Finalize()`. ```c int MPI_Init(&argc, &argv); int MPI_Finalize(); ``` -Both `MPI_Init` and `MPI_Finalize` return an integer. -This describes errors that may happen in the function. -Usually we will return the value of `MPI_Finalize` from the main function. +Both `MPI_Init()` and `MPI_Finalize()` return an integer. This describes errors that may happen in the function. +Usually we will return the value of `MPI_Finalize()` from the main function. + +::::callout + +## I don't use command line arguments + +If your main function has no arguments because your program doesn't use any command line arguments, you can instead pass NULL to `MPI_Init()` instead. + +```c +int main(void) { + MPI_Init(NULL, NULL); + return MPI_Finalize(); +} +``` +:::: -After MPI is initialized, you can find out the total number of ranks and the specific rank of the copy: +After MPI is initialised, you can find out the total number of ranks and the specific rank of the copy: ```c int num_ranks, my_rank; @@ -251,7 +265,7 @@ int main(int argc, char** argv) { printf("My rank number is %d out of %d\n", my_rank, num_ranks); - // Call finalize at the end + // Call finalise at the end return MPI_Finalize(); } ``` @@ -270,10 +284,10 @@ mpirun -n 4 mpi_rank You should see something like (although the ordering may be different): ```text -My rank number is 1 -My rank number is 2 -My rank number is 0 -My rank number is 3 +My rank number is 1 out of 4 +My rank number is 2 out of 4 +My rank number is 3 out of 4 +My rank number is 0 out of 4 ``` The reason why the results are not returned in order is because the order in which the processes run is arbitrary. @@ -367,5 +381,5 @@ For this we would need ways for ranks to communicate - the primary benefit of MP ## What About Python? -In [MPI for Python (mpi4py)](https://mpi4py.readthedocs.io/en/stable/), the initialization and finalization of MPI are handled by the library, and the user can perform MPI calls after ``from mpi4py import MPI``. +In [MPI for Python (mpi4py)](https://mpi4py.readthedocs.io/en/stable/), the initialisation and finalisation of MPI are handled by the library, and the user can perform MPI calls after ``from mpi4py import MPI``. :::: diff --git a/high_performance_computing/hpc_mpi/03_communicating_data.md b/high_performance_computing/hpc_mpi/03_communicating_data.md index 3643f7a6..3cf757ad 100644 --- a/high_performance_computing/hpc_mpi/03_communicating_data.md +++ b/high_performance_computing/hpc_mpi/03_communicating_data.md @@ -1,9 +1,7 @@ --- name: Communicating Data in MPI -dependsOn: [ - high_performance_computing.hpc_mpi.02_mpi_api -] -tags: [] +dependsOn: [high_performance_computing.hpc_mpi.02_mpi_api] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -18,241 +16,287 @@ learningOutcomes: - List the basic MPI data types. --- -In previous episodes we've seen that when we run an MPI application, multiple *independent* processes are created which do their own work, on their own data, in their own private memory space. -At some point in our program, one rank will probably need to know about the data another rank has, such as when combining a problem back together which was split across ranks. -Since each rank's data is private to itself, we can't just access another rank's memory and get what we need from there. -We have to instead explicitly *communicate* data between ranks. Sending and receiving data between ranks form some of the most basic building blocks in any MPI application, and the success of your parallelisation often relies on how you communicate data. +In previous episodes we've seen that when we run an MPI application, multiple *independent* processes are created which do their own work, on their own data, in their own private memory space. At some point in our program, one rank will probably need to know about the data another rank has, such as when combining a problem back together which was split across ranks. Since each rank's data is private to itself, we can't just access another rank's memory and get what we +need from there. We have to instead explicitly *communicate* data between ranks. Sending and receiving data +between ranks form some of the most basic building blocks in any MPI application, and the success of your parallelisation often relies on how you communicate data. ## Communicating data using messages -MPI is a standardised framework for passing data and other messages between independently running processes. -If we want to share or access data from one rank to another, we use the MPI API to transfer data in a "message." -To put it simply, a message is merely a data structure which contains the data, and is usually expressed as a collection of data elements of a particular data type. - -Sending and receiving data happens in one of two ways. -We either want to send data from one specific rank to another, known as point-to-point communication, or to/from multiple ranks all at once to a single target, known as collective communication. -In both cases, we have to *explicitly* "send" something and to *explicitly* "receive" something. -We've emphasised *explicitly* here to make clear that data communication can't happen by itself. -That is a rank can't just "pluck" data from one rank, because a rank doesn't automatically send and receive the data it needs. -If we don't program in data communication, data can't be shared. -None of this communication happens for free, either. -With every message sent, there is an associated overhead which impacts the performance of your program. -Often we won't notice this overhead, as it is quite small. -But if we communicate large amounts data or too often, those small overheads can rapidly add up into a noticeable performance hit. +MPI is a framework for passing data and other messages between independently running processes. If we want to share or +access data from one rank to another, we use the MPI API to transfer data in a "message." A message is a data structure +which contains the data we want to send, and is usually expressed as a collection of data elements of a particular data +type. + +Sending and receiving data can happen happen in two patterns. We either want to send data from one specific rank to +another, known as point-to-point communication, or to/from multiple ranks all at once to a single or multiples targets, +known as collective communication. Whatever we do, we always have to *explicitly* "send" something and to *explicitly* +"receive" something. Data communication can't happen by itself. A rank can't just get data from one rank, and ranks +don't automatically send and receive data. If we don't program in data communication, data isn't exchanged. +Unfortunately, none of this communication happens for free, either. With every message sent, there is an overhead which +impacts the performance of your program. Often we won't notice this overhead, as it is usually quite small. But if we +communicate large data or small amounts too often, those (small) overheads add up into a noticeable performance hit. + +To get an idea of how communication typically happens, imagine we have two ranks: rank A and rank B. If rank A wants to +send data to rank B (e.g., point-to-point), it must first call the appropriate MPI send function which typically (but +not always, as we'll find out later) puts that data into an internal *buffer*; known as the **send buffer** or +the **envelope**. Once the data is in the buffer, MPI figures out how to route the message to rank B (usually over a +network) and then sends it to B. To receive the data, rank B must call a data receiving function which will listen for +messages being sent to it. In some cases, rank B will then send an acknowledgement to say that the transfer has +finished, similar to read receipts in e-mails and instant messages. + +:::::challenge{id=check-understanding, title="Check Your Understanding"} +Consider a simulation where each rank calculates the physical properties for a subset of cells on a very large grid of points. One step of the calculation needs to know the average temperature across the entire grid of points. How would you approach calculating the average temperature? + +::::solution +There are multiple ways to approach this situation, but the most efficient approach would be to use collective operations to send the average temperature to a main rank which performs the final calculation. You can, of course, also use a point-to-point pattern, but it would be less efficient, especially with a large number of ranks. + +If the simulation wasn't done in parallel, or was instead using shared-memory parallelism, such as OpenMP, we wouldn't need to do any communication to get the data required to calculate the average. +:::: +::::: + +## MPI data types + +When we send a message, MPI needs to know the size of the data being transferred. The size is not the number of bytes of +data being sent, as you may expect, but is instead the number of elements of a specific data type being sent. When we +send a message, we have to tell MPI how many elements of "something" we are sending and what type of data it is. If we +don't do this correctly, we'll either end up telling MPI to send only *some* of the data or try to send more data than +we want! For example, if we were sending an array and we specify too few elements, then only a subset of the array will +be sent or received. But if we specify too many elements, than we are likely to end up with either a segmentation fault +or undefined behaviour! And the same can happen if we don't specify the correct data type. + +There are two types of data type in MPI: "basic" data types and derived data types. The basic data types are in essence +the same data types we would use in C such as `int`, `float`, `char` and so on. However, MPI doesn't use the same +primitive C types in its API, using instead a set of constants which internally represent the data types. These data +types are in the table below: + +| MPI basic data type | C equivalent | +| ---------------------- | ---------------------- | +| MPI_SHORT | short int | +| MPI_INT | int | +| MPI_LONG | long int | +| MPI_LONG_LONG | long long int | +| MPI_UNSIGNED_CHAR | unsigned char | +| MPI_UNSIGNED_SHORT | unsigned short int | +| MPI_UNSIGNED | unsigned int | +| MPI_UNSIGNED_LONG | unsigned long int | +| MPI_UNSIGNED_LONG_LONG | unsigned long long int | +| MPI_FLOAT | float | +| MPI_DOUBLE | double | +| MPI_LONG_DOUBLE | long double | +| MPI_BYTE | char | + +Remember, these constants aren't the same as the primitive types in C, so we can't use them to create variables, e.g., + +```c +MPI_INT my_int = 1; +``` + +is not valid code because, under the hood, these constants are actually special data structures used by MPI. Therefore +we can only them as arguments in MPI functions. ::::callout -## Common mistakes +## Don't forget to update your types + +At some point during development, you might change an `int` to a `long` or a `float` to a `double`, or something to +something else. Once you've gone through your codebase and updated the types for, e.g., variable declarations and +function arguments, you must do the same for MPI functions. If you don't, you'll end up running into communication +errors. It could be helpful to define compile-time constants for the data types and use those instead. If you ever do +need to change the type, you would only have to do it in one place, e.g.: -A common mistake for new MPI users is to write code using point-to-point communication which emulates what the collective communication functions are designed to do. -This is an inefficient way to share data. -The collective routines in MPI have multiple tricks and optimizations up their sleeves, resulting in communication overheads much lower than the equivalent point-to-point approach. -One other advantage is that collective communication often requires less code to achieve the same thing, which is always a win. -It is there almost always better to use collective operations where you can. +```c +// define constants for your data types +#define MPI_INT_TYPE MPI_INT +#define INT_TYPE int +// use them as you would normally +INT_TYPE my_int = 1; +``` :::: -To get an idea of how communication typically happens, imagine we have two ranks: rank A and rank B. -If rank A wants to send data to rank B (e.g., point-to-point), it must first call the appropriate MPI send function which puts that data into an internal *buffer*; sometimes known as the send buffer or envelope. -Once the data is in the buffer, MPI figures out how to route the message to rank B (usually over a network) and sends it to B. -To receive the data, rank B must call a data receiving function which will listen for any messages being sent. -When the message has been successfully routed and the data transfer complete, rank B sends an acknowledgement back to rank A to say that the transfer has finished, similarly to how read receipts work in e-mails and instant messages. +Derived data types, on the other hand, are very similar to C structures which we define by using the basic MPI data types. They're often useful to group together similar data in communications, or when you need to send a structure from one rank to another. This is covered in more detail in the optional [Advanced Communication Techniques](../hpc_mpi/11_advanced_communication.md) episode. -:::::challenge{id=check-understanding, title="Check Your Understanding"} -In an imaginary simulation, each rank is responsible for calculating the physical properties for a subset of cells on a larger simulation grid. -Another calculation, however, needs to know the average of, for example, the temperature for the subset of cells for each rank. What approaches could you use to share this data? +:::::challenge{id=what-type, title="What Type Should You Use?"} +For the following pieces of data, what MPI data types should you use? + +1. `a[] = {1, 2, 3, 4, 5};` +2. `a[] = {1.0, -2.5, 3.1456, 4591.223, 1e-10};` +3. `a[] = "Hello, world!";` ::::solution -There are multiple ways to approach this situation, but the most efficient approach would be to use collective operations to send the average temperature to a root rank (or all ranks) to perform the final calculation. -You can, of course, also use a point-to-point pattern, but it would be less efficient. + +The fact that `a[]` is an array does not matter, because all of the elemnts in `a[]` will be the same data type. In MPI, as we'll see in the next episode, we can either send a single value or multiple values (in an array). + +1. `MPI_INT` +2. `MPI_DOUBLE` - `MPI_FLOAT` would not be correct as `float`'s contain 32 bits of data whereas `double`s are 64 bit. +3. `MPI_BYTE` or `MPI_CHAR` - you may want to use [strlen](https://man7.org/linux/man-pages/man3/strlen.3.html) to calculate how many elements of `MPI_CHAR` being sent. :::: ::::: -### Communication modes +## Communicators + +All communication in MPI is handled by something known as a **communicator**. We can think of a communicator as being a +collection of ranks which are able to exchange data with one another. What this means is that every communication +between two (or more) ranks is linked to a specific communicator. When we run an MPI application, every rank will belong +to the default communicator known as `MPI_COMM_WORLD`. We've seen this in earlier episodes when, for example, we've used +functions like `MPI_Comm_rank()` to get the rank number, + +```c +int my_rank; +MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // MPI_COMM_WORLD is the communicator the rank belongs to +``` + +In addition to `MPI_COMM_WORLD`, we can make sub-communicators and distribute ranks into them. Messages can only be sent and received to and from the same communicator, effectively isolating messages to a communicator. For most applications, we usually don't need anything other than `MPI_COMM_WORLD`. But organising ranks into communicators can be helpful in some circumstances, as you can create small "work units" of multiple ranks to dynamically schedule the workload, or to help compartmentalise the problem into smaller chunks by using a +[virtual cartesian topology](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node192.htm#Node192). Throughout this course, we will stick to using `MPI_COMM_WORLD`. -There are multiple "modes" on how data is sent in MPI: standard, buffered, synchronous and ready. -When an MPI communication function is called, control/execution of the program is passed from the calling program to the MPI function. -Your program won't continue until MPI is happy that the communication happened successfully. -The difference between the communication modes is the criteria of a successful communication. +## Communication modes -To use the different modes, we don't pass a special flag. -Instead, MPI uses different functions to separate the different modes. -The table below lists the four modes with a description and their associated functions (which will be covered in detail in the following episodes). +When sending data between ranks, MPI will use one of four communication modes: synchronous, buffered, ready or standard. When a communication function is called, it takes control of program execution until the send-buffer is safe to be re-used again. What this means is that it's safe to re-use the memory/variable you passed without affecting the data that is still being sent. If MPI didn't have this concept of safety, then you could quite easily overwrite or destroy any data before it is transferred fully! This would lead to some very strange behaviour which would be hard to debug. The difference between the communication mode is when the buffer becomes safe to re-use. MPI won't guess at which mode *should* be used. That is up to the programmer. Therefore each mode has an associated communication function: -| Mode | Description | MPI Function | -| - | - | - | -| Synchronous | Returns control to the program when the message has been sent and received successfully | `MPI_Ssend()` | -| Buffered | Control is returned when the data has been copied into in the send buffer, regardless of the receive being completed or not | `MPI_Bsend()` | -| Standard | Either buffered or synchronous, depending on the size of the data being sent and what your specific MPI implementation chooses to use | `MPI_Send()` | -| Ready | Will only return control if the receiving rank is already listening for a message | `MPI_Rsend()` | + +| Mode | Blocking function | +| ----------- | ----------------- | +| Synchronous | `MPI_SSend()` | +| Buffered | `MPI_Bsend()` | +| Ready | `MPI_Rsend()` | +| Send | `MPI_Send()` | In contrast to the four modes for sending data, receiving data only has one mode and therefore only a single function. -| Mode | Description | MPI Function | -| - | - | - | -| Receive | Returns control when data has been received successfully | `MPI_Recv()` | +| Mode | MPI Function | +| ------- | ------------ | +| Receive | `MPI_Recv()` | -### Blocking vs. non-blocking communication +### Synchronous sends -Communication can also be done in two additional ways: blocking and non-blocking. -In blocking mode, communication functions will only return once the send buffer is ready to be re-used, meaning that the message has been both sent and received. -In terms of a blocking synchronous send, control will not be passed back to the program until the message sent by rank A has reached rank B, and rank B has sent an acknowledgement back. -If rank B is never listening for messages, rank A will become *deadlocked*. -A deadlock happens when your program hangs indefinitely because the send (or receive) is unable to complete. -Deadlocks occur for a countless number of reasons. -For example, we may forget to write the corresponding receive function when sending data. -Alternatively, a function may return earlier due to an error which isn't handled properly, or a while condition may never be met creating an infinite loop. -Furthermore, ranks can sometimes crash silently making communication with them impossible, but this doesn't stop any attempts to send data to crashed rank. +In synchronous communication, control is returned when the receiving rank has received the data and sent back, or +"posted", confirmation that the data has been received. It's like making a phone call. Data isn't exchanged until +you and the person have both picked up the phone, had your conversation and hung the phone up. -::::callout +Synchronous communication is typically used when you need to guarantee synchronisation, such as in iterative methods or +time dependent simulations where it is vital to ensure consistency. It's also the easiest communication mode to develop +and debug with because of its predictable behaviour. -## Avoiding communication deadlocks +### Buffered sends -A common piece of advice in C is that when allocating memory using `malloc()`, always write the accompanying call to `free()` to help avoid memory leaks by forgetting to deallocate the memory later. -You can apply the same mantra to communication in MPI. When you send data, always write the code to receive the data as you may forget to later and accidentally cause a deadlock. -:::: +In a buffered send, the data is written to an internal buffer before it is sent and returns control back as soon as the +data is copied. This means `MPI_Bsend()` returns before the data has been received by the receiving rank, making this an +asynchronous type of communication as the sending rank can move onto its next task whilst the data is transmitted. This +is just like sending a letter or an e-mail to someone. You write your message, put it in an envelope and drop it off in +the postbox. You are blocked from doing other tasks whilst you write and send the letter, but as soon as it's in the +postbox, you carry on with other tasks and don't wait for the letter to be delivered! -Blocking communication works best when the work is balanced across ranks, so that each rank has an equal amount of things to do. -A common pattern in scientific computing is to split a calculation across a grid and then to share the results between all ranks before moving onto the next calculation. -If the workload is well balanced, each rank will finish at roughly the same time and be ready to transfer data at the same time. -But, as shown in the diagram below, if the workload is unbalanced, some ranks will finish their calculations earlier and begin to send their data to the other ranks before they are ready to receive data. -This means some ranks will be sitting around doing nothing whilst they wait for the other ranks to become ready to receive data, wasting computation time. +Buffered sends are good for large messages and for improving the performance of your communication patterns by taking +advantage of the asynchronous nature of the data transfer. -![Blocking communication](fig/blocking-wait.png) +### Ready sends -If most of the ranks are waiting around, or one rank is very heavily loaded in comparison, this could massively impact the performance of your program. -Instead of doing calculations, a rank will be waiting for other ranks to complete their work. +Ready sends are different to synchronous and buffered sends in that they need a rank to already be listening to receive +a message, whereas the other two modes can send their data before a rank is ready. It's a specialised type of +communication used **only** when you can guarantee that a rank will be ready to receive data. If this is not the case, +the outcome is undefined and will likely result in errors being introduced into your program. The main advantage of this +mode is that you eliminate the overhead of having to check that the data is ready to be sent, and so is often used in +performance critical situations. -Non-blocking communication hands back control, immediately, before the communication has finished. -Instead of your program being *blocked* by communication, ranks will immediately go back to the heavy work and instead periodically check if there is data to receive (which you must remember to program) instead of waiting around. -The advantage of this communication pattern is illustrated in the diagram below, where less time is spent communicating. +You can imagine a ready send as like talking to someone in the same room, who you think is listening. If they are +listening, then the data is transferred. If it turns out they're absorbed in something else and not listening to you, +then you may have to repeat yourself to make sure your transmit the information you wanted to! -![Non-blocking communication](fig/non-blocking-wait.png) +### Standard sends -This is a common pattern where communication and calculations are interwoven with one another, decreasing the amount of "dead time" where ranks are waiting for other ranks to communicate data. -Unfortunately, non-blocking communication is often more difficult to successfully implement and isn't appropriate for every algorithm. -In most cases, blocking communication is usually easier to implement and to conceptually understand, and is somewhat "safer" in the sense that the program cannot continue if data is missing. -However, the potential performance improvements of overlapping communication and calculation is often worth the more difficult implementation and harder to read/more complex code. +The standard send mode is the most commonly used type of send, as it provides a balance between ease of use and performance. Under the hood, the standard send is either a buffered or a synchronous send, depending on the availability of system resources (e.g. the size of the internal buffer) and which mode MPI has determined to be the most efficient. ::::callout -## Should I use blocking or non-blocking communication? +## Which mode should I use? -When you are first implementing communication into your program, it's advisable to first use blocking synchronous sends to start with, as this is arguably the easiest to use pattern. -Once you are happy that the correct data is being communicated successfully, but you are unhappy with performance, then it would be time to start experimenting with the different communication modes and blocking vs. non-blocking patterns to balance performance with ease of use and code readability and maintainability. +Each communication mode has its own use cases where it excels. However, it is often easiest, at first, to use +the standard send, `MPI_Send()`, and optimise later. If the standard send doesn't meet your requirements, or if you need more control over communication, then pick which communication mode suits your requirements best. You'll probably need to experiment to find the best! :::: -:::::challenge{id=communication-in-everyday-life, title="MPI Communication in Everyday Life?"} -We communicate with people non-stop in everyday life, whether we want to or not! -Think of some examples/analogies of blocking and non-blocking communication we use to talk to other people. +::::callout{variant="note"} -::::solution -Probably the most common example of blocking communication in everyday life would be having a conversation or a phone call with someone. -The conversation can't happen and data can't be communicated until the other person responds or picks up the phone. -Until the other person responds, we are stuck waiting for the response. +## Communication mode summary: -Sending e-mails or letters in the post is a form of non-blocking communication we're all familiar with. -When we send an e-mail, or a letter, we don't wait around to hear back for a response. -We instead go back to our lives and start doing tasks instead. -We can periodically check our e-mail for the response, and either keep doing other tasks or continue our previous task once we've received a response back from our e-mail. +|Mode | Description | Analogy | MPI Function | +| --- | ----------- | ------- | ------------ | +| Synchronous | Returns control to the program when the message has been sent and received successfully. | Making a phone call | `MPI_Ssend()`| +| Buffered | Returns control immediately after copying the message to a buffer, regardless of whether the receive has happened or not. | Sending a letter or e-mail | `MPI_Bsend()` | +| Ready | Returns control immediately, assuming the matching receive has already been posted. Can lead to errors if the receive is not ready. | Talking to someone you think/hope is listening | `MPI_Rsend()` | +| Standard | Returns control when it's safe to reuse the send buffer. May or may not wait for the matching receive (synchronous mode), depending on MPI implementation and message size. | Phone call or letter | `MPI_Send()` | :::: -::::: -## Communicators +### Blocking vs. non-blocking communication -Communication in MPI happens in something known as a *communicator*. -We can think of a communicator as fundamentally being a collection of ranks which are able to exchange data with one another. -What this means is that every communication between two (or more) ranks is linked to a specific communicator in the program. -When we run an MPI application, the ranks will belong to the default communicator known as `MPI_COMM_WORLD`. -We've seen this in earlier episodes when, for example, we've used functions like `MPI_Comm_rank()` to get the rank number, +In addition to the communication modes, communication is done in two ways: either by blocking execution +until the communication is complete (like how a synchronous send blocks until an receive acknowledgment is sent back), +or by returning immediately before any part of the communication has finished, with non-blocking communication. Just +like with the different communication modes, MPI doesn't decide if it should use blocking or non-blocking communication +calls. That is, again, up to the programmer to decide. As we'll see in later episodes, there are different functions +for blocking and non-blocking communication. -```c -int my_rank; -MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* MPI_COMM_WORLD is the communicator the rank belongs to */ -``` +A blocking synchronous send is one where the message has to be sent from rank A, received by B and an acknowledgment +sent back to A before the communication is complete and the function returns. In the non-blocking version, the function +returns immediately even before rank A has sent the message or rank B has received it. It is still synchronous, so rank +B still has to tell A that it has received the data. But, all of this happens in the background so other work can +continue in the foreground which data is transferred. It is then up to the programmer to check periodically if the +communication is done -- and to not modify/use the data/variable/memory before the communication has been completed. -In addition to `MPI_COMM_WORLD`, we can make sub-communicators and distribute ranks into them. -Messages can only be sent and received to and from the same communicator, effectively isolating messages to a communicator. -For most applications, we usually don't need anything other than `MPI_COMM_WORLD`. -But organising ranks into communicators can be helpful in some circumstances, as you can create small "work units" of multiple ranks to dynamically schedule the workload, or to help compartmentalise the problem into smaller chunks by using a [virtual cartesian topology](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node192.htm#Node192). -Throughout this lesson, we will stick to using `MPI_COMM_WORLD`. - -## Basic MPI data types - -To send a message, we need to know the size of it. -The size is not the number of bytes of data that is being sent but is instead expressed as the number of elements of a particular data type you want to send. -So when we send a message, we have to tell MPI how many elements of "something" we are sending and what type of data it is. -If we don't do this correctly, we'll either end up telling MPI to send only *some* of the data or try to send more data than we want! -For example, if we were sending an array and we specify too few elements, then only a subset of the array will be sent or received. -But if we specify too many elements, than we are likely to end up with either a segmentation fault or some undefined behaviour! -If we don't specify the correct data type, then bad things will happen under the hood when it comes to communicating. - -There are two types of data type in MPI: basic MPI data types and derived data types. -The basic data types are in essence the same data types we would use in C (or Fortran), such as `int`, `float`, `bool` and so on. -When defining the data type of the elements being sent, we don't use the primitive C types. -MPI instead uses a set of compile-time constants which internally represents the data types. -These data types are in the table below: - -| MPI basic data type | C equivalent | -| - | - | -| MPI_SHORT | short int | -| MPI_INT | int | -| MPI_LONG | long int | -| MPI_LONG_LONG | long long int | -| MPI_UNSIGNED_CHAR | unsigned char | -| MPI_UNSIGNED_SHORT |unsigned short int | -| MPI_UNSIGNED | unsigned int | -| MPI_UNSIGNED_LONG | unsigned long int | -| MPI_UNSIGNED_LONG_LONG | unsigned long long int | -| MPI_FLOAT | float | -| MPI_DOUBLE | double | -| MPI_LONG_DOUBLE | long double | -| MPI_BYTE | char | +:::callout -These constants don't expand out to actual date types, so we can't use them in variable declarations, e.g., +## Is `MPI_Bsend()` non-blocking? -```c -MPI_INT my_int; -``` +The buffered communication mode is a type of asynchronous communication, because the function returns before the data has been received by another rank. But, it's not a non-blocking call **unless** you use the non-blocking version +`MPI_Ibsend()` (more on this later). Even though the data transfer happens in the background, allocating and copying data to the send buffer happens in the foreground, blocking execution of our program. On the other hand, `MPI_Ibsend()` is "fully" asynchronous because even allocating and copying data to the send buffer happens in the background. +::: -is not valid code because under the hood, these constants are actually special structs used internally. -Therefore we can only uses these expressions as arguments in MPI functions. +One downside to blocking communication is that if rank B is never listening for messages, rank A will become *deadlocked*. A deadlock happens when your program hangs indefinitely because the send (or receive) is unable to +complete. Deadlocks occur for a countless number of reasons. For example, we may forget to write the corresponding +receive function when sending data. Or a function may return earlier due to an error which isn't handled properly, or a +while condition may never be met creating an infinite loop. Ranks can also can silently, making communication with them +impossible, but this doesn't stop any attempts to send data to crashed rank. ::::callout -## Don't forget to update your types +## Avoiding communication deadlocks -At some point during development, you might change an `int` to a `long` or a `float` to a `double`, or something to something else. -Once you've gone through your codebase and updated the types for, e.g., variable declarations and function signatures, you must also do the same for MPI functions. -If you don't, you'll end up running into communication errors. -It may be helpful to define compile-time constants for the data types and use those instead. If you ever do need to change the type, you would only have to do it in one place. +A common piece of advice in C is that when allocating memory using `malloc()`, always write the accompanying call to +`free()` to help avoid memory leaks by forgetting to deallocate the memory later. +You can apply the same mantra to communication in MPI. When you send data, always write the code to receive the data as you may forget to later and accidentally cause a deadlock. +:::: -```c -/* define constants for your data types */ -#define AGE_MPI_TYPE MPI_INT -#define AGE_TYPE int -/* use them as you would normally */ -AGE_TYPE my_age = 25; -``` +Blocking communication works best when the work is balanced across ranks, so that each rank has an equal amount of things to do. A common pattern in scientific computing is to split a calculation across a grid and then to share the results between all ranks before moving onto the next calculation. If the workload is well balanced, each rank will finish at roughly the same time and be ready to transfer data at the same time. But, as shown in the diagram below, if the workload is unbalanced, some ranks will finish their calculations earlier and begin to send their data to the other ranks before they are ready to receive data. This means some ranks will be sitting around doing nothing whilst they wait for the other ranks to become ready to receive data, wasting computation time. -:::: +![Blocking communication](fig/blocking-wait.png) -Derived data types are data structures which you define, built using the basic MPI data types. -These derived types are analogous to defining structures or type definitions in C. -They're most often helpful to group together similar data to send/receive multiple things in a single communication, or when you need to communicate non-contiguous data such as "vectors" or sub-sets of an array. -This will be covered in the **Advanced Communication Techniques** episode. +If most of the ranks are waiting around, or one rank is very heavily loaded in comparison, this could massively impact the performance of your program. Instead of doing calculations, a rank will be waiting for other ranks to complete their work. -:::::challenge{id=what-type, title="What Type Should You Use?"} -For the following pieces of data, what MPI data types should you use? +Non-blocking communication hands back control, immediately, before the communication has finished. Instead of your +program being *blocked* by communication, ranks will immediately go back to the heavy work and instead periodically +check if there is data to receive (which is up to the programmer) instead of waiting around. The advantage of this +communication pattern is illustrated in the diagram below, where less time is spent communicating. -1. `a[] = {1, 2, 3, 4, 5};` -2. `a[] = {1.0, -2.5, 3.1456, 4591.223, 1e-10};` -3. `a[] = "Hello, world!";` +![Non-blocking communication](fig/non-blocking-wait.png) -::::solution +This is a common pattern where communication and calculations are interwoven with one another, decreasing the amount of "dead time" where ranks are waiting for other ranks to communicate data. +Unfortunately, non-blocking communication is often more difficult to successfully implement and isn't appropriate for every algorithm. In most cases, blocking communication is usually easier to implement and to conceptually understand, and is somewhat "safer" in the sense that the program cannot continue if data is missing. +However, the potential performance improvements of overlapping communication and calculation is often worth the more difficult implementation and harder to read/more complex code. -1. `MPI_INT` -2. `MPI_DOUBLE` - `MPI_FLOAT` would not be correct as `float`'s contain 32 bits of data whereas `double`s are 64 bit. -3. `MPI_BYTE` or `MPI_CHAR` - you may want to use [strlen](https://man7.org/linux/man-pages/man3/strlen.3.html) to calculate how many elements of `MPI_CHAR` being sent +::::callout + +## Should I use blocking or non-blocking communication? + +When you are first implementing communication into your program, it's advisable to first use blocking synchronous sends to start with, as this is arguably the easiest to use pattern. Once you are happy that the correct data is being communicated successfully, but you are unhappy with performance, then it would be time to start experimenting with the different communication modes and blocking vs. non-blocking patterns to balance performance with ease of use and code readability and maintainability. +:::: + +:::::challenge{id=communication-in-everyday-life, title="MPI Communication in Everyday Life?"} +We communicate with people non-stop in everyday life, whether we want to or not! +Think of some examples/analogies of blocking and non-blocking communication we use to talk to other people. + +::::solution +Probably the most common example of blocking communication in everyday life would be having a conversation or a phone call with someone. +The conversation can't happen and data can't be communicated until the other person responds or picks up the phone. +Until the other person responds, we are stuck waiting for the response. +Sending e-mails or letters in the post is a form of non-blocking communication we're all familiar with. When we send an e-mail, or a letter, we don't wait around to hear back for a response. We instead go back to our lives and start doing tasks instead. We can periodically check our e-mail for the response, and either keep doing other tasks or continue our previous task once we've received a response back from our e-mail. :::: ::::: diff --git a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md index 9177ee28..5e96b262 100644 --- a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md +++ b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md @@ -1,9 +1,7 @@ --- name: Point-to-Point Communication -dependsOn: [ - high_performance_computing.hpc_mpi.03_communicating_data -] -tags: [] +dependsOn: [high_performance_computing.hpc_mpi.03_communicating_data] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -13,32 +11,31 @@ attribution: learningOutcomes: - Describe what is meant by point-to-point communication. - Learn how to send and receive data between ranks. - --- In the previous episode we introduced the various types of communication in MPI. -In this section we will use the MPI library functions `MPI_Send` and `MPI_Recv`, which employ point-to-point communication, to send data from one rank to another. +In this section we will use the MPI library functions `MPI_Send()` and `MPI_Recv()`, which employ point-to-point communication, to send data from one rank to another. -![Sending data from one rank to another using MPI_SSend and MPI_Recv](fig/send-recv.png) +![Sending data from one rank to another using MPI_SSend and MPI_Recv()](fig/send-recv.png) -Let's look at how `MPI_Send` and `MPI_Recv`are typically used: +Let's look at how `MPI_Send()` and `MPI_Recv()`are typically used: - Rank A decides to send data to rank B. It first packs the data to send into a buffer, from which it will be taken. -- Rank A then calls `MPI_Send` to create a message for rank B. +- Rank A then calls `MPI_Send()` to create a message for rank B. The underlying MPI communication is then given the responsibility of routing the message to the correct destination. -- Rank B must know that it is about to receive a message and acknowledge this by calling `MPI_Recv`. +- Rank B must know that it is about to receive a message and acknowledge this by calling `MPI_Recv()`. This sets up a buffer for writing the incoming data when it arrives and instructs the communication device to listen for the message. -As mentioned in the previous episode, `MPI_Send` and `MPI_Recv` are *synchronous* operations, +As mentioned in the previous episode, `MPI_Send()` and `MPI_Recv()` are *synchronous* operations, and will not return until the communication on both sides is complete. -## Sending a Message: MPI_Send +## Sending a Message: MPI_Send() -The `MPI_Send` function is defined as follows: +The `MPI_Send()` function is defined as follows: ```c int MPI_Send( - const void* data, + const void *data, int count, MPI_Datatype datatype, int destination, @@ -46,15 +43,15 @@ int MPI_Send( MPI_Comm communicator ) ``` +| | | +| ------- | -------- | +| `*data`: | Pointer to the start of the data being sent. We would not expect this to change, hence it's defined as `const` | +| `count`: | Number of elements to send | +| `datatype`: | The type of the element data being sent, e.g. MPI_INTEGER, MPI_CHAR, MPI_FLOAT, MPI_DOUBLE, ... | +| `destination`: | The rank number of the rank the data will be sent to | +| `tag`: | An message tag (integer), which is used to differentiate types of messages. We can specify `0` if we don't need different types of messages | +| `communicator`: | The communicator, e.g. MPI_COMM_WORLD as seen in previous episodes | -| Argument | Function | -| --- | -------- | -| `data` | Pointer to the start of the data being sent. We would not expect this to change, hence it's defined as `const` | -| `count` | Number of elements to send | -| `datatype` | The type of the element data being sent, e.g. MPI_INTEGER, MPI_CHAR, MPI_FLOAT, MPI_DOUBLE, ... | -| `destination` | The rank number of the rank the data will be sent to | -| `tag` | An optional message tag (integer), which is optionally used to differentiate types of messages. We can specify `0` if we don't need different types of messages | -| `communicator` | The communicator, e.g. MPI_COMM_WORLD as seen in previous episodes | For example, if we wanted to send a message that contains `"Hello, world!\n"` from rank 0 to rank 1, we could state (assuming we were rank 0): @@ -64,14 +61,13 @@ char *message = "Hello, world!\n"; MPI_Send(message, 14, MPI_CHAR, 1, 0, MPI_COMM_WORLD); ``` -So we are sending 14 elements of `MPI_CHAR` one time, and specified `0` for our message tag since we don't anticipate having to send more than one type of message. -This call is synchronous, and will block until the corresponding `MPI_Recv` operation receives and acknowledges receipt of the message. +So we are sending 14 elements of `MPI_CHAR()` one time, and specified `0` for our message tag since we don't anticipate having to send more than one type of message. This call is synchronous, and will block until the corresponding `MPI_Recv()` operation receives and acknowledges receipt of the message. ::::callout -## MPI_Ssend: an Alternative to MPI_Send +## MPI_Ssend(): an Alternative to MPI_Send() -`MPI_Send` represents the "standard mode" of sending messages to other ranks, but some aspects of its behaviour are dependent on both the implementation of MPI being used, and the circumstances of its use. There are three scenarios to consider: +`MPI_Send()` represents the "standard mode" of sending messages to other ranks, but some aspects of its behaviour are dependent on both the implementation of MPI being used, and the circumstances of its use. There are three scenarios to consider: 1. The message is directly passed to the receive buffer, in which case the communication has completed 2. The send message is buffered within some internal MPI buffer but hasn't yet been received @@ -80,33 +76,35 @@ This call is synchronous, and will block until the corresponding `MPI_Recv` oper In scenarios 1 & 2, the call is able to return immediately, but with 3 it may block until the recipient is ready to receive. It is dependent on the MPI implementation as to what scenario is selected, based on performance, memory, and other considerations. -A very similar alternative to `MPI_Send` is to use `MPI_Ssend` - synchronous send - which ensures the communication is both synchronous and blocking. +A very similar alternative to `MPI_Send()` is to use `MPI_Ssend()` - synchronous send - which ensures the communication is both synchronous and blocking. This function guarantees that when it returns, the destination has categorically started receiving the message. :::: -## Receiving a Message: MPI_Recv +## Receiving a Message: MPI_Recv() -Conversely, the `MPI_Recv` function looks like the following: +Conversely, the `MPI_Recv()` function looks like the following: ```c int MPI_Recv( - void* data, + void *data, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm communicator, - MPI_Status* status + MPI_Status *status ) ``` -| `data`: | Pointer to where the received data should be written | +| | | +| --- | ---- | +| `*data`: | Pointer to where the received data should be written | | `count`: | Maximum number of elements to receive | | `datatype`: | The type of the data being received | | `source`: | The number of the rank sending the data | | `tag`: | A message tag (integer), which must either match the tag in the sent message, or if `MPI_ANY_TAG` is specified, a message with any tag will be accepted | | `communicator`: | The communicator (we have used `MPI_COMM_WORLD` in earlier examples) | -| `status`: | A pointer for writing the exit status of the MPI command, indicating | +| `status`: | A pointer for writing the exit status of the MPI command, indicating whether the operation succeeded or failed | Continuing our example, to receive our message we could write: @@ -117,11 +115,11 @@ MPI_Recv(message, 14, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status); ``` Here, we create our buffer to receive the data, as well as a variable to hold the exit status of the receive operation. -We then call `MPI_Recv`, specifying our returned data buffer, the number of elements we will receive (14) which will be of type `MPI_CHAR` and sent by rank 0, with a message tag of 0. -As with `MPI_Send`, this call will block - in this case until the message is received and acknowledgement is sent to rank 0, at which case both ranks will proceed. +We then call `MPI_Recv()`, specifying our returned data buffer, the number of elements we will receive (14) which will be of type `MPI_CHAR` and sent by rank 0, with a message tag of 0. +As with `MPI_Send()`, this call will block - in this case until the message is received and acknowledgement is sent to rank 0, at which case both ranks will proceed. Let's put this together with what we've learned so far. -Here's an example program that uses `MPI_Send` and `MPI_Recv` to send the string `"Hello World!"` from rank 0 to rank 1: +Here's an example program that uses `MPI_Send()` and `MPI_Recv()` to send the string `"Hello World!"` from rank 0 to rank 1: ```c #include @@ -138,7 +136,7 @@ int main(int argc, char** argv) { if( n_ranks != 2 ){ printf("This example requires exactly two ranks\n"); MPI_Finalize(); - return(1); + return 1; } // Get my rank @@ -156,11 +154,19 @@ int main(int argc, char** argv) { printf("%s",message); } - // Call finalize at the end + // Call finalise at the end return MPI_Finalize(); } ``` +::::callout + +## MPI Data Types in C + +In the above example we send a string of characters and therefore specify the type `MPI_CHAR`. For a complete list of types, +see [the MPICH documentation](https://www.mpich.org/static/docs/v3.3/www3/Constants.html). +:::: + :::::challenge{id=try-it-out, title="Try It Out"} Compile and run the above code. Does it behave as you expect? @@ -191,7 +197,7 @@ Try modifying, compiling, and re-running the code to see what happens if you... ::::solution 1. The program will hang since it's waiting for a message with a tag that will never be sent (press `Ctrl-C` to kill the hanging process). - To resolve this, make the tag in `MPI_Recv` match the tag you specified in `MPI_Send`. + To resolve this, make the tag in `MPI_Recv()` match the tag you specified in `MPI_Send()`. 2. You will likely see a message like the following: ```text @@ -252,7 +258,7 @@ int main(int argc, char** argv) { } } - // Call finalize at the end + // Call finalise at the end return MPI_Finalize(); } ``` @@ -261,6 +267,7 @@ int main(int argc, char** argv) { ::::: :::::challenge{id=hello-again, title="Hello Again, World!"} + Modify the Hello World code below so that each rank sends its message to rank 0. Have rank 0 print each message. @@ -268,19 +275,22 @@ Have rank 0 print each message. #include #include -int main(int argc, char** argv) { - int rank; +int main(int argc, char **argv) { + int rank; + int message[30]; - // First call MPI_Init - MPI_Init(&argc, &argv); + // First call MPI_Init + MPI_Init(&argc, &argv); - // Get my rank - MPI_Comm_rank(MPI_COMM_WORLD, &rank); + // Get my rank + MPI_Comm_rank(MPI_COMM_WORLD, &rank); - printf("Hello World, I'm rank %d\n", rank); + // Print a message using snprintf and then printf + snprintf(message, 30, "Hello World, I'm rank %d", rank); + printf("%s\n", message); - // Call finalize at the end - return MPI_Finalize(); + // Call finalise at the end + return MPI_Finalize(); } ``` @@ -290,167 +300,97 @@ int main(int argc, char** argv) { #include #include -int main(int argc, char** argv) { - int rank, n_ranks, numbers_per_rank; - - // First call MPI_Init - MPI_Init(&argc, &argv); - // Get my rank and the number of ranks - MPI_Comm_rank(MPI_COMM_WORLD, &rank); - MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); - - if( rank != 0 ) { - // All ranks other than 0 should send a message - - char message[30]; - sprintf(message, "Hello World, I'm rank %d\n", rank); - MPI_Send(message, 30, MPI_CHAR, 0, 0, MPI_COMM_WORLD); - - } else { - // Rank 0 will receive each message and print them - - for( int sender = 1; sender < n_ranks; sender++ ) { - char message[30]; - MPI_Status status; - - MPI_Recv(message, 30, MPI_CHAR, sender, 0, MPI_COMM_WORLD, &status); - printf("%s",message); +int main(int argc, char **argv) { + int rank, n_ranks, numbers_per_rank; + + // First call MPI_Init + MPI_Init(&argc, &argv); + // Get my rank and the number of ranks + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); + + if (rank != 0) { + // All ranks other than 0 should send a message + + char message[30]; + sprintf(message, "Hello World, I'm rank %d\n", rank); + MPI_Send(message, 30, MPI_CHAR, 0, 0, MPI_COMM_WORLD); + + } else { + // Rank 0 will receive each message and print them + + for( int sender = 1; sender < n_ranks; sender++ ) { + char message[30]; + MPI_Status status; + + MPI_Recv(message, 30, MPI_CHAR, sender, 0, MPI_COMM_WORLD, &status); + printf("%s",message); + } } - } - - // Call finalize at the end - return MPI_Finalize(); + + // Call finalise at the end + return MPI_Finalize(); } ``` - :::: ::::: :::::challenge{id=blocking, title="Blocking"} Try the code below and see what happens. How would you change the code to fix the problem? -*Note: If you are using the MPICH library, this example might automagically work. With OpenMPI it shouldn't!)* +Note: *If you are using the MPICH, this example might automagically work. With OpenMPI it shouldn't!* ```c -#include -#include #include -int main(int argc, char** argv) { - int rank, n_ranks, neighbour; - int n_numbers = 10000; - int *send_message; - int *recv_message; - MPI_Status status; - - send_message = malloc(n_numbers*sizeof(int)); - recv_message = malloc(n_numbers*sizeof(int)); +#define ARRAY_SIZE 3 - // First call MPI_Init - MPI_Init(&argc, &argv); +int main(int argc, char **argv) { + MPI_Init(&argc, &argv); - // Get my rank and the number of ranks - MPI_Comm_rank(MPI_COMM_WORLD, &rank); - MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); + int rank; + MPI_Comm_rank(MPI_COMM_WORLD, &rank); - // Check that there are exactly two ranks - if( n_ranks != 2 ){ - printf("This example requires exactly two ranks\n"); - MPI_Finalize(); - return(1); - } + const int comm_tag = 1; + int numbers[ARRAY_SIZE] = {1, 2, 3}; + MPI_Status recv_status; - // Call the other rank the neighbour - if( rank == 0 ) { - neighbour = 1; - } else { - neighbour = 0; - } - - // Generate numbers to send - for( int i=0; i -#include +Sometimes `MPI_Send()` will actually make a copy of the buffer and return immediately. This generally happens only for short messages. Even when this happens, the actual transfer will not start before the receive is posted. -int main(int argc, char** argv) { - int rank, n_ranks, neighbour; - int n_numbers = 524288; - int send_message[n_numbers]; - int recv_message[n_numbers]; - MPI_Status status; +For this example, let’s have rank 0 send first, and rank 1 receive first. So all we need to do to fix this is to swap the send and receive for rank 1: - // First call MPI_Init - MPI_Init(&argc, &argv); - - // Get my rank and the number of ranks - MPI_Comm_rank(MPI_COMM_WORLD, &rank); - MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); - - // Generate numbers to send - for( int i=0; i "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -13,50 +11,55 @@ attribution: learningOutcomes: - Understand the different types of collective communication and their advantages. - Learn how to use collective communication functions. - --- -The previous episode showed how to send data from one rank to another using point-to-point communication. -If we wanted to send data from multiple ranks to a single rank to, for example, add up the value of a variable across multiple ranks, we have to manually loop over each rank to communicatethe data. -This type of communication, where multiple ranks talk to one another known as called *collective communication*. -In the code example below, point-to-point communication is used to calculate the sum of the rank numbers, +The previous episode showed how to send data from one rank to another using point-to-point communication. If we wanted to send data from multiple ranks to a single rank to, for example, add up the value of a variable across multiple ranks, we have to manually loop over each rank to communicate the data. This type of communication, where multiple ranks talk to one another known as called collective communication. In the code example below, point-to-point communication is used to calculate the sum of the rank numbers - feel free to try it out! ```c -MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); -MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); +#include +#include -int sum; -MPI_Status status; +int main(int argc, char **argv) { + int my_rank, num_ranks; -/* Rank 0 is the "root" rank, where we'll receive data and sum it up */ -if (my_rank == 0) { - sum = my_rank; - /* Start by receiving the rank number from every rank, other than itself */ - for (int i = 1; i < num_ranks; ++i) { - int recv_num; - MPI_Recv(&recv_num, 1, MPI_INT, i, 0, MPI_COMM_WORLD, &status); - sum += recv_num; /* Increment sum */ - } - /* Now sum has been calculated, send it back to every rank other than the root */ - for (int i = 1; i < num_ranks; ++i) { - MPI_Send(&sum, 1, MPI_INT, i, 0, MPI_COMM_WORLD); + // First call MPI_Init + MPI_Init(&argc, &argv); + + MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); + MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); + + int sum; + MPI_Status status; + + // Rank 0 is the "root" rank, where we'll receive data and sum it up + if (my_rank == 0) { + sum = my_rank; + + // Start by receiving the rank number from every rank, other than itself + for (int i = 1; i < num_ranks; ++i) { + int recv_num; + MPI_Recv(&recv_num, 1, MPI_INT, i, 0, MPI_COMM_WORLD, &status); + sum += recv_num; + } + // Now sum has been calculated, send it back to every rank other than the root + for (int i = 1; i < num_ranks; ++i) { + MPI_Send(&sum, 1, MPI_INT, i, 0, MPI_COMM_WORLD); + } + } else { // All other ranks will send their rank number and receive sum */ + MPI_Send(&my_rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); + MPI_Recv(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } -} else { /* All other ranks will send their rank number and receive sum */ - MPI_Send(&my_rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); - MPI_Recv(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); -} -printf("Rank %d has a sum of %d\n", my_rank, sum); + printf("Rank %d has a sum of %d\n", my_rank, sum); + + // Call finalise at the end + return MPI_Finalize(); +} ``` -For it's use case, the code above works perfectly fine. -However, it isn't very efficient when you need to communicate large amounts of data, have lots of ranks, or when the workload is uneven (due to the blocking communication). -It's also a lot of code to do not much, which makes it easy to introduce mistakes in our code. -A common mistake in this example would be to start the loop over ranks from 0, which would cause a deadlock! +For it's use case, the code above works perfectly fine. However, it isn't very efficient when you need to communicate large amounts of data, have lots of ranks, or when the workload is uneven (due to the blocking communication). It's also a lot of code to do not much, which makes it easy to introduce mistakes in our code. A common mistake in this example would be to start the loop over ranks from 0, which would cause a deadlock! It's actually quite a common mistake for new MPI users to write something like the above. -We don't need to write code like this (unless we want *complete* control over the data communication), because MPI has access to collective communication functions to abstract all of this code for us. -The above code can be replaced by a single collective communication function. -Collection operations are also implemented far more efficiently in the MPI library than we could ever write using point-to-point communications. +We don't need to write code like this (unless we want *complete* control over the data communication), because MPI has access to collective communication functions to abstract all of this code for us. The above code can be replaced by a single collective communication function. Collection operations are also implemented far more efficiently in the MPI library than we could ever write using point-to-point communications. There are several collective operations that are implemented in the MPI standard. The most commonly-used are: @@ -71,26 +74,20 @@ There are several collective operations that are implemented in the MPI standard ### Barrier -The most simple form of collective communication is a barrier. -Barriers are used to synchronise ranks by adding a point in a program where ranks *must* wait until all ranks have reached the same point. -A barrier is a collective operation because all ranks need to communicate with one another to know when they can leave the barrier. -To create a barrier, we use the `MPI_Barrier()` function, +The most simple form of collective communication is a barrier. Barriers are used to synchronise ranks by adding a point in a program where ranks *must* wait until all ranks have reached the same point. A barrier is a collective operation because all ranks need to communicate with one another to know when they can leave the barrier. To create a barrier, we use the `MPI_Barrier()` function, ```c int MPI_Barrier( - MPI_Comm communicator /* The communicator we want to add a barrier for */ + MPI_Comm comm ); ``` +| | | +|------ | ------- | +| comm: | The communicator to add a barrier to | -When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. -As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. -We should also avoid using barriers in parts of our program has have complicated branches, as we may introduce a deadlock by having a barrier in only one branch. +When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. We should also avoid using barriers in parts of our program has have complicated branches, as we may introduce a deadlock by having a barrier in only one branch. -In practise, there are not that many practical use cases for a barrier in an MPI application. -In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. -But in MPI, where each rank has its own private memory space and often resources, it's rare that we need to care about ranks becoming out-of-sync. -However, one usecase is when multiple ranks need to write *sequentially* to the same file. -The code example below shows how you may handle this by using a barrier. +In practise, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, where each rank has its own private memory space and often resources, it's rare that we need to care about ranks becoming out-of-sync. However, one usecase is when multiple ranks need to write *sequentially* to the same file. The code example below shows how you may handle this by using a barrier. ```c for (int i = 0; i < num_ranks; ++i) { @@ -105,20 +102,25 @@ for (int i = 0; i < num_ranks; ++i) { ### Broadcast -We'll often find that we need to data from one rank to all the other ranks. -One approach, which is not very efficient, is to use `MPI_Send()` in a loop to send the data from rank to rank one by one. -A far more efficient approach is to use the collective function `MPI_Bcast()` to *broadcast* the data from a root rank to every other rank. +We'll often find that we need to data from one rank to all the other ranks. One approach, which is not very efficient, is to use `MPI_Send()` in a loop to send the data from rank to rank one by one. A far more efficient approach is to use the collective function `MPI_Bcast()` to *broadcast* the data from a root rank to every other rank. The `MPI_Bcast()` function has the following arguments, ```c int MPI_Bcast( - void* data, /* The data to be sent to all ranks */ - int count, /* The number of elements of data */ - MPI_Datatype datatype, /* The data type of the data */ - int root, /* The rank which the data should be sent from */ - MPI_Comm comm /* The communicator containing the ranks to broadcast to */ + void *data, + int count, + MPI_Datatype datatype, + int root, + MPI_Comm comm ); ``` +| | | +| ---- | ---- | +| `*data`: | The data to be sent to all ranks | +| `count`: | The number of elements of data | +| `datatype`: | The datatype of the data | +| `root`: | The rank which data will be sent from | +| `comm:` | The communicator containing the ranks to broadcast to | `MPI_Bcast()` is similar to the `MPI_Send()` function. The main functional difference is that `MPI_Bcast()` sends the data to all ranks (other than itself, where the data already is) instead of a single rank, as shown in the diagram below. @@ -169,13 +171,11 @@ int main(int argc, char **argv) { } MPI_Bcast(message, NUM_CHARS, MPI_CHAR, 0, MPI_COMM_WORLD); - printf("I'm rank %d and I got the message '%s'\n", my_rank, message); return MPI_Finalize(); } ``` - :::: ::::: @@ -192,17 +192,28 @@ We can use `MPI_Scatter()` to split the data into *equal* sized chunks and commu ```c int MPI_Scatter( - void* sendbuf, /* The data to be split across ranks (only important for the root rank) */ - int sendcount, /* The number of elements of data to send to each rank (only important for the root rank) */ - MPI_Datatype sendtype, /* The data type of the data being sent (only important for the root rank) */ - void* recvbuffer, /* A buffer to receive the data, including the root rank */ - int recvcount, /* The number of elements of data to receive, usually the same as sendcount */ - MPI_Datatype recvtype, /* The data types of the data being received, usually the same as sendtype */ - int root, /* The ID of the rank where data is being "scattered" from */ - MPI_Comm comm /* The communicator involved */ + void *sendbuff, + int sendcount, + MPI_Datatype sendtype, + void *recvbuffer, + int recvcount, + MPI_Datatype recvtype, + int root, + MPI_Comm comm ); ``` +| | | +| --- | --- | +| `*sendbuff`: | The data to be scattered across ranks (only important for the root rank) | +| `sendcount`: | The number of elements of data to send to each root rank (only important for the root rank) | +| `sendtype`: | The data type of the data being sent (only important for the root rank) | +| `*recvbuffer`: | A buffer to receive data into, including the root rank | +| `recvcount`: | The number of elements of data to receive. Usually the same as `sendcount` | +| `recvtype`: | The data type of the data being received. Usually the same as `sendtype` | +| `root`: | The rank data is being scattered from | +| `comm`: | The communicator | + The data to be *scattered* is split into even chunks of size `sendcount`. If `sendcount` is 2 and `sendtype` is `MPI_INT`, then each rank will receive two integers. The values for `recvcount` and `recvtype` are the same as `sendcount` and `sendtype`. @@ -237,16 +248,26 @@ We can do this by using the collection function `MPI_Gather()`, which has these ```c int MPI_Gather( - void* sendbuf, /* The data to be sent to the root rank */ - int sendcount, /* The number of elements of data to be sent */ - MPI_Datatype sendtype, /* The data type of the data to be sent */ - void* recvbuffer, /* The buffer to put the gathered data into (only important for the root rank) */ - int recvcount, /* Same as sendcount (only important for the root rank) */ - MPI_Datatype recvtype, /* Same as sendtype (import important for the root rank) */ - int root, /* The ID of the root rank, where data is being gathered to */ - MPI_Comm comm /* The communicator involved */ + void *sendbuff, + int sendcount, + MPI_Datatype sendtype, + void *recvbuff, + int recvcount, + MPI_Datatype recvtype, + int root, + MPI_Comm comm ); ``` +| | | +| --- | --- | +| `*sendbuff`: | The data to send to the root rank | +| `sendcount`: | The number of elements of data to send | +| `sendtype`: | The data type of the data being sent | +| `recvbuff`: | The buffer to put gathered data into (only important for the root rank) | +| `recvcount`: | The number of elements being received, usually the same as `sendcount` | +| `recvtype`: | The data type of the data being received, usually the same as `sendtype` | +| `root`: | The root rank, where data will be gathered to | +| `comm`: | The communicator | The receive buffer needs to be large enough to hold data data from all of the ranks. For example, if there are 4 ranks sending 10 integers, then `recvbuffer` needs to be able to store *at least* 40 integers. We can think of `MPI_Gather()` as being the inverse of `MPI_Scatter()`. @@ -326,16 +347,26 @@ Reduction operations can be done using the collection function `MPI_Reduce()`, w ```c int MPI_Reduce( - void* sendbuf, /* The data to be reduced on the root rank */ - void* recvbuffer, /* The buffer which will contain the reduction output */ - int count, /* The number of elements of data to be reduced */ - MPI_Datatype datatype, /* The data type of the data */ - MPI_Op op, /* The reduction operation to perform */ - int root, /* The root rank, to perform the reduction on */ - MPI_Comm comm /* The communicator where the reduction will be performed */ + void *sendbuff, + void *recvbuffer, + int count, + MPI_Datatype datatype, + MPI_Op op, + int root, + MPI_Comm comm ); ``` +| | | +| --- | --- | +| `*sendbuff`: | The data to be reduced by the root rank | +| `*recvbuffer`: | A buffer to contain the reduction output | +| `count`: | The number of elements of data to be reduced | +| `datatype`: | The data type of the data | +| `op`: | The reduction operation to perform | +| `root`: | The root rank, which will perform the reduction | +| `comm`: | The communicator | + The `op` argument controls which reduction operation is carried out, from the following possible operations: | Operation | Description | @@ -347,17 +378,16 @@ The `op` argument controls which reduction operation is carried out, from the fo | `MPI_MAXLOC` | Return the maximum value and the number of the rank that sent the maximum value. | | `MPI_MINLOC` | Return the minimum value of the number of the rank that sent the minimum value. | -In a reduction operation, each ranks sends a piece of data to the root rank, which are combined, depending on the choice of operation, into a single value on the root rank, as shown in the diagram below. -Since the data is sent and operation done on the root rank, it means the reduced value is only available on the root rank. +In a reduction operation, each ranks sends a piece of data to the root rank, which are combined, depending on the choice of operation, into a single value on the root rank, as shown in the diagram below. Since the data is sent and operation done on the root rank, it means the reduced value is only available on the root rank. ![Each rank sending a piece of data to root rank](fig/reduction.png) By using `MPI_Reduce()` and `MPI_Bcast()`, we can refactor the first code example into two collective functions: ```c -int sum; MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); +int sum; MPI_Reduce(&my_rank, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Bcast(&sum, 1, MPI_INT, 0, MPI_COMM_WORLD); /* Using MPI_Bcast to send the reduced value to every rank */ ``` @@ -366,26 +396,30 @@ MPI_Bcast(&sum, 1, MPI_INT, 0, MPI_COMM_WORLD); /* Using MPI_Bcast to send the ### Allreduce -In the code example just above, after the reduction operation we used `MPI_Bcast()` to communicate the result to every rank in the communicator. -This is a common pattern, so much so that there is a collective operation which does both in a -single function call: +In the code example just above, after the reduction operation we used `MPI_Bcast()` to communicate the result to every rank in the communicator. This is a common pattern, so much so that there is a collective operation which does both in a single function call: ```c int MPI_Allreduce( - void* sendbuf, /* The data to be reduced */ - void* recvbuffer, /* The buffer which will contain the reduction output */ - int count, /* The number of elements of data to be reduced */ - MPI_Datatype datatype, /* The data type of the data */ - MPI_Op op, /* The reduction operation to use */ - MPI_Comm comm  /* The communicator where the reduction will be performed */ + void *sendbuff, + void *recvbuffer, + int count, + MPI_Datatype datatype, + MPI_Op op, + MPI_Comm comm  ); ``` +| | | +| --- | --- | +| `*sendbuff`: | The data to be reduced, on all ranks | +| `*recvbuffer`: | A buffer which will contain the reduction output | +| `count`: | The number of elements of data to be reduced | +| `datatype`: | The data type of the data | +| `op`: | The reduction operation to use | +| `comm`: | The communicator | ![Each rank sending a piece of data to root rank](fig/allreduce.png) -`MPI_Allreduce()` performs the same operations as `MPI_Reduce()`, but the result is sent to all ranks rather than only being available on the root rank. -This means we can remove the `MPI_Bcast()` in the previous code example and remove almost all of the code in the reduction example using point-to-point communication at the beginning of the episode. -This is shown in the following code example: +`MPI_Allreduce()` performs the same operations as `MPI_Reduce()`, but the result is sent to all ranks rather than only being available on the root rank. This means we can remove the `MPI_Bcast()` in the previous code example and remove almost all of the code in the reduction example using point-to-point communication at the beginning of the episode. This is shown in the following code example: ```c int sum; @@ -400,8 +434,7 @@ MPI_Allreduce(&my_rank, &sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); ## In-Place Operations -In MPI, we can use in-place operations to eliminate the need for separate send and receive buffers in some collective operations. -We typically do this by using the `MPI_IN_PLACE` constant in place of the send buffer, as in the example below using `MPI_Allreduce()`: +In MPI, we can use in-place operations to eliminate the need for separate send and receive buffers in some collective operations. We typically do this by using the `MPI_IN_PLACE` constant in place of the send buffer, as in the example below using `MPI_Allreduce()`: ```c sum = my_rank; @@ -424,57 +457,57 @@ Modify the `find_sum` and `find_max` functions to work correctly in parallel usi #include // Calculate the sum of numbers in a vector -double find_sum( double * vector, int N ){ - double sum = 0; - for( int i=0; i max ){ - max = vector[i]; - } - } - return max; +double find_maximum(double *vector, int N) { + double max = 0; + for (int i = 0; i < N; ++i){ + if (vector[i] > max){ + max = vector[i]; + } + } + return max; } -int main(int argc, char** argv) { - int n_numbers = 1024; - int rank; - double vector[n_numbers]; - double sum, max; - double my_first_number; +int main(int argc, char **argv) { + int n_numbers = 1024; + int rank; + double vector[n_numbers]; + double sum, max; + double my_first_number; - // First call MPI_Init - MPI_Init(&argc, &argv); + // First call MPI_Init + MPI_Init(&argc, &argv); - // Get my rank - MPI_Comm_rank(MPI_COMM_WORLD, &rank); + // Get my rank + MPI_Comm_rank(MPI_COMM_WORLD, &rank); - // Each rank will have n_numbers numbers, - // starting from where the previous left off - my_first_number = n_numbers*rank; + // Each rank will have n_numbers numbers, + // starting from where the previous left off + my_first_number = n_numbers*rank; - // Generate a vector - for( int i=0; i max ){ - max = vector[i]; - } - } +double find_maximum(double *vector, int N) { + double max = 0; + double global_max; - // Call MPI_Allreduce to find the maximum over all the ranks - MPI_Allreduce( &max, &global_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD ); + // Calculate the sum on this rank as before + for (int i = 0; i < N; ++i){ + if (vector[i] > max){ + max = vector[i]; + } + } - return global_max; + // Call MPI_Allreduce to find the maximum over all the ranks + MPI_Allreduce(&max, &global_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD); + + return global_max; } ``` - :::: ::::: @@ -522,6 +555,6 @@ double find_maximum( double * vector, int N ){ ## More collective operations are available -The collective functions introduced in this episode do not represent an exhaustive list of *all* collective operations in MPI. -There are a number which are not covered, as their usage is not as common. You can usually find a list of the collective functions available for the implementation of MPI you choose to use, e.g. [Microsoft MPI documentation](https://learn.microsoft.com/en-us/message-passing-interface/mpi-collective-functions). +The collective functions introduced in this episode do not represent an exhaustive list of *all* collective operations in MPI. There are a number which are not covered, as their usage is not as common. You can usually find a list of the collective functions available for the implementation of MPI you choose to use, e.g. +[Microsoft MPI documentation](https://learn.microsoft.com/en-us/message-passing-interface/mpi-collective-functions). :::: diff --git a/high_performance_computing/hpc_mpi/06_non_blocking_communication.md b/high_performance_computing/hpc_mpi/06_non_blocking_communication.md index 307810d5..58bf2d90 100644 --- a/high_performance_computing/hpc_mpi/06_non_blocking_communication.md +++ b/high_performance_computing/hpc_mpi/06_non_blocking_communication.md @@ -1,9 +1,7 @@ --- name: Non-blocking Communication -dependsOn: [ - high_performance_computing.hpc_mpi.05_collective_communication -] -tags: [] +dependsOn: [high_performance_computing.hpc_mpi.05_collective_communication] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -14,46 +12,39 @@ learningOutcomes: - Understand the advantages and disadvantages of non-blocking communication. - Use non-blocking MPI communication functions in a program. - Describe deadlock in an MPI program. - --- -In the previous episodes, we learnt how to send messages between two ranks or collectively to multiple ranks. -In both cases, we used blocking communication functions which meant our program wouldn't progress until data had been sent and received successfully. -It takes time, and computing power, to transfer data into buffers, to send that data around (over the network) and to receive the data into another rank. -But for the most part, the CPU isn't actually doing anything. +In the previous episodes, we learnt how to send messages between two ranks or collectively to multiple ranks. In both cases, we used blocking communication functions which meant our program wouldn’t progress until the communication had completed. It takes time and computing power to transfer data into buffers, to send that data around (over the network) and to receive the data into another rank. But for the most part, the CPU isn’t actually doing much at all during communication, when it could still be number crunching. ## Why bother with non-blocking communication? -Non-blocking communication is a communication mode, which allows ranks to continue working on other tasks, whilst data is transferred in the background. -When we use blocking communication, like `MPI_Send()`, `MPI_Recv()`, `MPI_Reduce()` and etc, execution is passed from our program to MPI and is not passed back until the communication has finished. -With non-blocking communication, the communication beings and control is passed back immediately. -Whilst the data is transferred in the background, our application is free to do other work. -This ability to *overlap* computation and communication is absolutely critical for good performance for many HPC applications. -The CPU is used very little when communicating data, so we are effectively wasting resources by not using them when we can. -With good use of non-blocking communication, we can continue to use the CPU whilst communication happens and, at the same time, hide/reduce some of the communication overhead by overlapping communication and computation. - -Reducing the communication overhead is incredibly important for the scalability of HPC applications, especially when we use lots of ranks. -As the number of ranks increases, the communication overhead to talk to every rank, naturally, also increases. -Blocking communication limits the scalability of our MPI applications, as it can, relatively speaking, take a long time to talk to lots of ranks. -But since with non-blocking communication ranks don't sit around waiting for a communication operation to finish, the overhead of talking to lots of reduced. -The asynchronous nature of non-blocking communication makes it more flexible, allowing us to write more sophisticated and performance communication algorithms. - -All of this comes with a price. -Non-blocking communication is more difficult to use *effectively*, and oftens results in more complex code. -Not only does it result in more code, but we also have to think about the structure and flow of our code in such a way there there is *other* work to do whilst data is being communicated. -Additionally, whilst we usually expect non-blocking communication to improve the performance, and scalability, of our parallel algorithms, it's not always clear cut or predictable if it can help. -If we are not careful, we may end up replacing blocking communication overheads with synchronization overheads. -For example, if one rank depends on the data of another rank and there is no other work to do, that rank will have to wait around until the data it needs is ready, as illustrated in the diagram below. +Non-blocking communication is communication which happens in the background. So we don’t have to let any CPU cycles go to waste! If MPI is dealing with the data transfer in the background, we can continue to use the CPU in the foreground and keep doing tasks whilst the communication completes. By *overlapping* computation with communication, we hide the latency/overhead of communication. This is critical for lots of HPC applications, especially when using lots of CPUs, because, as the number of CPUs increases, the overhead of communicating with them all also increases. If you use blocking synchronous sends, the time spent communicating data may become longer than the time spent creating data to send! All non-blocking communications are asynchronous, even when using synchronous sends, because the communication happens in the background, even though the communication cannot complete until the data is received. -![Non-blocking communication with data dependency](fig/non-blocking-wait-data.png) +::::callout + +## So, how do I use non-blocking communication? + +Just as with buffered, synchronous, ready and standard sends, MPI has to be programmed to use either blocking or non-blocking communication. For almost every blocking function, there is a non-blocking equivalent. They have the same name as their blocking counterpart, but prefixed with "I". The "I" stands for "immediate", indicating that the function returns immediately and does not block the program. The table below shows some examples of blocking functions and their non-blocking counterparts. + +| Blocking | Non-blocking | +| --------------- | ---------------- | +| `MPI_Bsend()` | `MPI_Ibsend()` | +| `MPI_Barrier()` | `MPI_Ibarrier()` | +| `MPI_Reduce()` | `MPI_Ireduce()` | + +But, this isn't the complete picture. As we'll see later, we need to do some additional bookkeeping to be able to use +non-blocking communications. +:::: -:::::challenge{id=advantages-and-disadvantages, title="Advantages and Disadvantages"} +By effectively utilizing non-blocking communication, we can develop applications that scale significantly better during intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. Since non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging to see if the communication has finished, to ensure ranks are synchronised. If we check too often, or don't have enough tasks to "fill in the gaps", then there is no advantage to using non-blocking communication and we may replace communication overheads with time spent keeping ranks in sync! It is not always clear cut or predictable if non-blocking communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. This essentially makes that non-blocking communication a blocking communication. Therefore unless our code is structured to take advantage of being able to overlap communication with computation, non-blocking communication adds complexity to our code for no gain. -## Advantages and disadvantages +![Non-blocking communication with data dependency](fig/non-blocking-wait-data.png) + +::::challenge{id=advantages-and-disadvantages, title="Advantages and Disadvantages"} What are the main advantages of using non-blocking communication, compared to blocking? What about any disadvantages? -::::solution +:::solution Some of the advantages of non-blocking communication over blocking communication include: - Non-blocking communication gives us the ability to interleave communication with computation. @@ -66,9 +57,8 @@ On the other hand, some disadvantages are: - It is more difficult to use non-blocking communication. Not only does it result in more, and more complex, lines of code, we also have to worry about rank synchronisation and data dependency. - Whilst typically using non-blocking communication, where appropriate, improves performance, it's not always clear cut or predictable if non-blocking will result in sufficient performance gains to justify the increased complexity. - +::: :::: -::::: ## Point-to-point communication @@ -77,72 +67,79 @@ For example, if we take `MPI_Send()`, the non-blocking variant is `MPI_Isend()` ```c int MPI_Isend( - const void *buf, /* The data to be sent */ - int count, /* The number of elements of data to be sent */ - MPI_Datatype datatype, /* The datatype of the data */ - int dest, /* The rank to send data to */ - int tag, /* The communication tag */ - MPI_Comm comm, /* The communicator to use */ - MPI_Request *request, /* The communication request handle */ + void *buf, + int count, + MPI_Datatype datatype, + int dest, + int tag, + MPI_Comm comm, + MPI_Request *request ); ``` +| | | +|-------------|-----------------------------------------------------| +| `*buf`: | The data to be sent | +| `count`: | The number of elements of data | +| `datatype`: | The data types of the data | +| `dest`: | The rank to send data to | +| `tag`: | The communication tag | +| `comm`: | The communicator | +| `*request`: | The request handle, used to track the communication | + The arguments are identical to `MPI_Send()`, other than the addition of the `*request` argument. This argument is known as an *handle* (because it "handles" a communication request) which is used to track the progress of a (non-blocking) communication. -::::callout - -## Naming conventions - -Non-blocking functions have the same name as their blocking counterpart, but prefixed with "I". -The "I" stands for "immediate", indicating that the function returns immediately and does not block the program whilst data is being communicated in the background. The table below shows some examples of blocking functions and their non-blocking counterparts. - -| Blocking | Non-blocking| -| -------- | ----------- | -| `MPI_Bsend()` | `MPI_Ibsend()` | -| `MPI_Barrier()` | `MPI_Ibarrier()` | -| `MPI_Reduce()` | `MPI_Ireduce()` | -:::: - -When we use non-blocking communication, we have to follow it up with `MPI_Wait()` to synchronise the program and make sure `*buf` is ready to be re-used. -This is incredibly important to do. +When we use non-blocking communication, we have to follow it up with `MPI_Wait()` to synchronise +the program and make sure `*buf` is ready to be re-used. This is incredibly important to do. Suppose we are sending an array of integers, ```c -MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); +MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); some_ints[1] = 5; /* !!! don't do this !!! */ ``` - -Modifying `some_ints` before the send has completed is undefined behaviour, and can result in breaking results! -For example, if `MPI_Isend` decides to use its buffered mode then modifying `some_ints` before it's finished being copied to the send buffer will means the wrong data is sent. -Every non-blocking communication has to have a corresponding `MPI_Wait()`, to wait and synchronise the program to ensure that the data being sent is ready to be modified again. -`MPI_Wait()` is a blocking function which will only return when a communication has finished. +Modifying `some_ints` before the send has completed is undefined behaviour, and can result in breaking results! For +example, if `MPI_Isend()` decides to use its buffered mode then modifying `some_ints` before it's finished being copied to the send buffer will means the wrong data is sent. Every non-blocking communication has to have a corresponding `MPI_Wait()`, to wait and synchronise the program to ensure that the data being sent is ready to be modified again. `MPI_Wait()` is a blocking function which will only return when a communication has finished. ```c int MPI_Wait( - MPI_Request *request, /* The request handle for the communication to wait for */ - MPI_Status *status, /* The status handle for the communication */ + MPI_Request *request, + MPI_Status *status ); ``` +| | | +|-------------|----------------------| +| `*request`: | The request handle for the communication | +| `*status`: | The status handle for the communication | -Once we have used `MPI_Wait()` and the communication has finished, we can safely modify `some_ints` again. -To receive the data send using a non-blocking send, we can use either the blocking `MPI_Recv()` or it's non-blocking variant: +Once we have used `MPI_Wait()` and the communication has finished, we can safely modify `some_ints` again. To receive +the data send using a non-blocking send, we can use either the blocking `MPI_Recv()` or it's non-blocking variant. ```c int MPI_Irecv( - void *buf, /* The buffer to receive data into */ - int count, /* The number of elements of data to receive */ - MPI_Datatype datatype, /* The datatype of the data being received */ - int source, /* The rank to receive data from */ - int tag, /* The communication tag */ - MPI_Comm comm, /* The communicator to use */ - MPI_Request *request, /* The communication request handle */ + void *buf, + int count, + MPI_Datatype datatype, + int source, + int tag, + MPI_Comm comm, + MPI_Request *request, ); ``` +| | | +|-------------|----------------------| +| `*buf`: | A buffer to receive data into | +| `count`: | The number of elements of data to receive | +| `datatype`: | The data type of the data | +| `source`: | The rank to receive data from | +| `tag`: | The communication tag | +| `comm`: | The communicator | +| `*request`: | The request handle for the receive | + + :::::challenge{id=true-or-false, title="True or False?"} -Is the following statement true or false? -Non-blocking communication guarantees immediate completion of data transfer. +Is the following statement true or false? Non-blocking communication guarantees immediate completion of data transfer. ::::solution **False**. Just because the communication function has returned, does not mean the communication has finished and the communication buffer is ready to be re-used or read from. @@ -162,19 +159,15 @@ int some_ints[5] = { 1, 2, 3, 4, 5 }; if (my_rank == 0) { MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); - some_ints[1] = 42; /* After MPI_Wait(), some_ints has been sent and can be modified again */ + some_ints[1] = 42; // After MPI_Wait(), some_ints has been sent and can be modified again } else { MPI_Irecv(recv_ints, 5, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); - int data_i_wanted = recv_ints[2]; /* recv_ints isn't guaranteed to have the correct data until after MPI_Wait()*/ + int data_i_wanted = recv_ints[2]; // recv_ints isn't guaranteed to have the correct data until after MPI_Wait() } ``` -The code above is functionally identical to blocking communication, because of `MPI_Wait()` is blocking. -The program will not continue until `MPI_Wait()` returns. -Since there is no additional work between the communication call and blocking wait, this is a poor example of how non-blocking communication should be used. -It doesn't take advantage of the asynchronous nature of non-blocking communication at all. -To really make use of non-blocking communication, we need to interleave computation (or any busy work we need to do) with communication, such as as in the next example. +The code above is functionally identical to blocking communication, because of `MPI_Wait()` is blocking. The program will not continue until `MPI_Wait()` returns. Since there is no additional work between the communication call and blocking wait, this is a poor example of how non-blocking communication should be used. It doesn't take advantage of the asynchronous nature of non-blocking communication at all. To really make use of non-blocking communication, we need to interleave computation (or any busy work we need to do) with communication, such as as in the next example. ```c MPI_Status status; @@ -224,6 +217,7 @@ The non-blocking version of the code snippet may look something like this: ```c MPI_Request send_req, recv_req; + if (my_rank == 0) { MPI_Isend(numbers, 8, MPI_INT, 1, 0, MPI_COMM_WORLD, &send_req); MPI_Irecv(numbers, 8, MPI_INT, 1, 0, MPI_COMM_WORLD, &recv_req); @@ -234,16 +228,14 @@ if (my_rank == 0) { MPI_Status statuses[2]; MPI_Request requests[2] = { send_req, recv_req }; -MPI_Waitall(2, requests, statuses); /* Wait for both requests in one call */ +MPI_Waitall(2, requests, statuses); // Wait for both requests in one function ``` -This version of the code will not deadlock, because the non-blocking functions return immediately. -So even though rank 0 and 1 one both send, meaning there is no corresponding receive, the immediate return from send means the receive function is still called. -Thus a deadlock cannot happen. +This version of the code will not deadlock, because the non-blocking functions return immediately. So even though rank +0 and 1 one both send, meaning there is no corresponding receive, the immediate return from send means the +receive function is still called. Thus a deadlock cannot happen. -However, it is still possible to create a deadlock using `MPI_Wait()`. -If `MPI_Wait()` is waiting to for `MPI_Irecv()` to get some data, but there is no matching send operation (so no data has been sent), then `MPI_Wait()` can never return resulting in a deadlock. -In the example code below, rank 0 becomes deadlocked. +However, it is still possible to create a deadlock using `MPI_Wait()`. If `MPI_Wait()` is waiting to for `MPI_Irecv()` to get some data, but there is no matching send operation (so no data has been sent), then `MPI_Wait()` can never return resulting in a deadlock. In the example code below, rank 0 becomes deadlocked. ```c MPI_Status status; @@ -256,56 +248,53 @@ if (my_rank == 0) { MPI_Irecv(numbers, 8, MPI_INT, 0, 0, MPI_COMM_WORLD, &recv_req); } -MPI_Wait(&send_req, &status); /* Wait for both requests in one call */ -MPI_Wait(&recv_req, &status); /* Wait for both requests in one call */ +MPI_Wait(&send_req, &status); +MPI_Wait(&recv_req, &status); // Wait for both requests in one function ``` - :::: ::::: ## To wait, or not to wait -In some sense, by using `MPI_Wait()` we aren't fully non-blocking because we still block execution whilst we wait for communications to complete. -To be "truly" asynchronous we can use another function called `MPI_Test()` which, at face value, is the non-blocking counterpart of `MPI_Wait()`. -When we use `MPI_Test()`, it checks if a communication is finished and sets the value of a flag to true if it is and returns. -If a communication hasn't finished, `MPI_Test()` still returns but the value of the flag is false instead. `MPI_Test()` has the following arguments: +In some sense, by using `MPI_Wait()` we aren't fully non-blocking because we still block execution whilst we wait for communications to complete. To be "truly" asynchronous we can use another function called `MPI_Test()` which, at face value, is the non-blocking counterpart of `MPI_Wait()`. When we use `MPI_Test()`, it checks if a communication is finished and sets the value of a flag to true if it is and returns. If a communication hasn't finished, `MPI_Test()` still returns but the value of the flag is false instead. `MPI_Test()` has the following arguments: ```c int MPI_Test( - MPI_Request *request, /* The request handle for the communication to test */ - int *flag, /* A flag to indicate if the communication has completed - returned by pointer */ - MPI_Status *status, /* The status handle for the communication to test */ + MPI_Request *request, + int *flag, + MPI_Status *status, ); ``` +| | | +|-------------|------------------------------------------| +| `*request`: | The request handle for the communication | +| `*flag`: | A flag to indicate if the communication has completed | +| `*status`: | The status handle for the communication | -`*request` and `*status` are the same you'd use for `MPI_Wait()`. `*flag` is the variable which is modified to indicate if the communication has finished or not. -Since it's an integer, if the communication hasn't finished then `flag == 0`. +`*request` and `*status` are the same you'd use for `MPI_Wait()`. `*flag` is the variable which is modified to indicate if the communication has finished or not. Since it's an integer, if the communication hasn't finished then `flag == 0`. -We use `MPI_Test()` is much the same way as we'd use `MPI_Wait()`. -We start a non-blocking communication, and keep doing other, independent, tasks whilst the communication finishes. -The key difference is that since `MPI_Test()` returns immediately, we may need to make multiple calls before the communication is finished. -In the code example below, `MPI_Test()` is used within a `while` loop which keeps going until either the communication has finished or until there is no other work left to do. +We use `MPI_Test()` is much the same way as we'd use `MPI_Wait()`. We start a non-blocking communication, and keep doing other, independent, tasks whilst the communication finishes. The key difference is that since `MPI_Test()` returns immediately, we may need to make multiple calls before the communication is finished. In the code example below, `MPI_Test()` is used within a `while` loop which keeps going until either the communication has finished or until there is no other work left to do. ```c MPI_Status status; MPI_Request request; MPI_Irecv(recv_buffer, 16, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); -/* We need to define a flag, to track when the communication has completed */ +// We need to define a flag, to track when the communication has completed int comm_completed = 0; -/* One example use case is keep checking if the communication has finished, and continuing - to do CPU work until it has */ +// One example use case is keep checking if the communication has finished, and continuing +// to do CPU work until it has while (!comm_completed && work_still_to_do()) { do_some_other_work(); - /* MPI_Test will return flag == true when the communication has finished */ + // MPI_Test will return flag == true when the communication has finished MPI_Test(&request, &comm_completed, &status); } -/* If there is no more work and the communication hasn't finished yet, then we should wait - for it to finish */ +// If there is no more work and the communication hasn't finished yet, then we should wait +// for it to finish if (!comm_completed) { - MPI_Wait(&request, &status) + MPI_Wait(&request, &status); } ``` @@ -324,22 +313,22 @@ The most efficient, and, really, only practical, implementations use non-blockin Non-blocking communication gives us a lot of flexibility, letting us write complex communication algorithms to experiment and find the right solution. One example of that flexibility is using `MPI_Test()` to create a communication timeout algorithm. ```c -#define COMM_TIMEOUT 60 /* seconds */ +#define COMM_TIMEOUT 60 // seconds clock_t start_time = clock(); double elapsed_time = 0.0; int comm_completed = 0 -> + while (!comm_completed && elapsed_time < COMM_TIMEOUT) { - /* Check if communication completed */ + // Check if communication completed MPI_Test(&request, &comm_completed, &status); - /* Update elapsed time */ + // Update elapsed time elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC; } if (elapsed_time >= COMM_TIMEOUT) { - MPI_Cancel(&request); /* Cancel the request to stop the, e.g. receive operation */ - handle_communication_errors(); /* Put the program into a predictable state */ + MPI_Cancel(&request); // Cancel the request to stop the, e.g. receive operation + handle_communication_errors(); // Put the program into a predictable state } ``` @@ -348,13 +337,12 @@ In reality, however, it would be hard to find a useful and appropriate use case In any case, though, it demonstrate the power and flexibility offered by non-blocking communication. :::: -:::::challenge{id=try-it-yourself, title="Try It Yourself"} +:::::challenge{id=try-it-yourself, title="Try it yourself"} In the MPI program below, a chain of ranks has been set up so one rank will receive a message from the rank to its left and send a message to the one on its right, as shown in the diagram below: -[A chain of ranks](fig/rank_chain.png) +![A chain of ranks](fig/rank_chain.png) -For for following skeleton below, use non-blocking communication to send `send_message` to the right right and receive a message from the left rank. -Create two programs, one using `MPI_Wait()` and the other using `MPI_Test()`. +For following skeleton below, use non-blocking communication to send `send_message` to the right right and receive a message from the left rank. Create two programs, one using `MPI_Wait()` and the other using `MPI_Test()`. ```c #include @@ -372,7 +360,7 @@ int main(int argc, char **argv) if (num_ranks < 2) { printf("This example requires at least two ranks\n"); - MPI_Abort(1); + MPI_Abort(MPI_COMM_WORLD, 1); } char send_message[MESSAGE_SIZE]; @@ -383,9 +371,7 @@ int main(int argc, char **argv) sprintf(send_message, "Hello from rank %d!", my_rank); - /* - * Your code goes here - */ + // Your code goes here return MPI_Finalize(); } @@ -477,6 +463,17 @@ int MPI_Ireduce( ); ``` +| | | +|-------------|------------------------------------------| +| `*sendbuf`: | The data to be reduced by the root rank | +| `*recvbuf`: | A buffer to contain the reduction output | +| `count`: | The number of elements of data to be reduced | +| `datatype`: | The data type of the data | +| `op`: | The reduction operation to perform | +| `root`: | The root rank, which will perform the reduction | +| `comm`: | The communicator | +| `*request`: | The request handle for the communicator | + As with `MPI_Send()` vs. `MPI_Isend()` the only change in using the non-blocking variant of `MPI_Reduce()` is the addition of the `*request` argument, which returns a request handle. This is the request handle we'll use with either `MPI_Wait()` or `MPI_Test()` to ensure that the communication has finished, and been successful. The below code examples shows a non-blocking reduction: @@ -501,8 +498,8 @@ How do you think the non-blocking variant, `MPI_Ibarrier()`, is used and how mig You may want to read the relevant [documentation](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node127.htm) first. ::::solution -When a rank reaches a non-blocking barrier, `MPI_Ibarrier()` will return immediately whether other ranks have reached the barrier or not. The behaviour of the barrier we would expect is enforced at the next `MPI_Wait()` (or `MPI_Test()`) operation. -`MPI_Wait()` will return once all the ranks have reached the barrier. +When a rank reaches a non-blocking barrier, `MPI_Ibarrier()` will return immediately whether other ranks have reached the barrier or not. The behaviour of the barrier we would expect is enforced at the next `MPI_Wait()` (or `MPI_Test()`) operation. `MPI_Wait()` will return once all the ranks have reached the barrier. + Non-blocking barriers can be used to help hide/reduce synchronisation overhead. We may want to add a synchronisation point in our program so the ranks start some work all at the same time. With a blocking barrier, the ranks have to wait for every rank to reach the barrier, and can't do anything other than wait. @@ -538,9 +535,7 @@ int main(int argc, char **argv) printf("Start : Rank %d: my_num = %d sum = %d\n", my_rank, my_num, sum); - /* - * Your code goes here - */ + // Your code goes here printf("End : Rank %d: my_num = %d sum = %d\n", my_rank, my_num, sum); diff --git a/high_performance_computing/hpc_mpi/07-derived-data-types.md b/high_performance_computing/hpc_mpi/07-derived-data-types.md new file mode 100644 index 00000000..9ff111aa --- /dev/null +++ b/high_performance_computing/hpc_mpi/07-derived-data-types.md @@ -0,0 +1,366 @@ +--- +name: Derived Data Types +dependsOn: [high_performance_computing.hpc_mpi.06_non_blocking_communication] +tags: [mpi] +attribution: + - citation: > + "Introduction to the Message Passing Interface" course by the Southampton RSG + url: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/ + image: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/assets/img/home-logo.png + license: CC-BY-4.0 +learningOutcomes: + - Understand the problems of non-contiguous memory in MPI. + - Learn how to define and use derived data types. +--- + +We've so far seen the basic building blocks for splitting work and communicating data between ranks, meaning we're now dangerous enough to write a simple and successful MPI application. We've worked, so far, with simple data structures, such as single variables or small 1D arrays. In reality, any useful software we write will use more complex data structures, such as n-dimensional arrays, structures and other complex types. Working with these in MPI require a bit more work to communicate them correctly and efficiently. + +To help with this, MPI provides an interface to create new types known as derived data types. A derived type acts as a way to enable the translation of complex data structures into instructions which MPI uses for efficient data access communication. In this episode, we will learn how to use derived data types to send array vectors and sub-arrays. + +::::callout + +## Size limitations for messages + +All throughout MPI, the argument which says how many elements of data are being communicated is an integer: int count. In most 64-bit Linux systems, int's are usually 32-bit and so the biggest number you can pass to count is 2^31 - 1 = 2,147,483,647, which is about 2 billion. Arrays which exceed this length can't be communicated easily in versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in smaller, more manageable chunks. The other is to use derived types, to re-shape the data. +:::: + +Almost all scientific and computing problems nowadays require us to think in more than one dimension. Using +multi-dimensional arrays, such for matrices or tensors, or discretising something onto a 2D or 3D grid of points +are fundamental parts of a lot of software. However, the additional dimensions comes with additional complexity, +not just in the code we write, but also in how data is communicated. + +To create a 2 x 3 matrix, in C, and initialize it with some values, we use the following syntax, + +```c +int matrix[2][3] = { {1, 2, 3}, {4, 5, 6} }; // matrix[rows][cols] +``` + +This creates an array with two rows and three columns. The first row contains the values {1, 2, 3} and the second row contains {4, 5, 6}. The number of rows and columns can be any value, as long as there is enough memory available. + +## The importance of memory contiguity + +When a sequence of things is contiguous, it means there are multiple adjacent things without anything in between them. +In the context of MPI, when we talk about something being contiguous we are almost always talking about how arrays, and +other complex data structures, are stored in the computer's memory. The elements in an array are contiguous when the +next and previous elements are stored in adjacent memory locations. + +The memory space of a computer is linear. When we create a multi-dimensional array, the compiler and operating system +decide how to map and store the elements onto that linear space. There are two ways to do this: +[row-major or column-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The difference +is which elements of the array are contiguous in memory. Arrays are row-major in C and column-major in Fortran. +In a row-major array, the elements in each column of a row are contiguous, so element `x[i][j]` is +preceded by `x[i][j - 1]` and is followed by `x[i][j +1]`. In Fortran, arrays are column-major so `x(i, j)` is +followed by `x(i + 1, j)` and so on. + +The diagram below shows how a 4 x 4 matrix is mapped onto a linear memory space, for a row-major array. At the top of +the diagram is the representation of the linear memory space, where each number is ID of the element in memory. Below +that are two representations of the array in 2D: the left shows the coordinate of each element and the right shows the +ID of the element. + +![Column memory layout in C](fig/c_column_memory_layout.png) + +The purple elements (5, 6, 7, 8) which map to the coordinates `[1][0]`, `[1][1]`, `[1][2]` and `[1][3]` are contiguous in linear memory. The same applies for the orange boxes for the elements in row 2 (elements 9, 10, 11 and 12). Columns in row-major arrays are contiguous. The next diagram instead shows how elements in adjacent rows are mapped in memory. + +![Row memory layout in C](fig/c_row_memory_layout.png) + +Looking first at the purple boxes (containing elements 2, 6, 10 and 14) which make up the row elements for column 1, we can see that the elements are not contiguous. Element [0][1] maps to element 2 and element [1][1] maps to element 6 and so on. Elements in the same column but in a different row are separated by four other elements, in this example. In other words, elements in other rows are not contiguous. + +:::::challenge{id=memory-contiquity, title="Does memory contiguity affect performance?"} + +Do you think memory contiguity could impact the performance of our software, in a negative way? + +::::solution + +Yes, memory contiguity can affect how fast our programs run. When data is stored in a neat and organized way, the computer can find and use it quickly. But if the data is scattered around randomly (fragmented), it takes more time to locate and use it, which decreases performance. Keeping our data and data access patterns organized can make our programs faster. But we probably won't notice the difference for small arrays and data structures. +:::: +::::: + +::::callout + +## What about if I use `malloc()`? + +More often than not we will see `malloc()` being used to allocate memory for arrays. Especially if the code is using an older standard, such as C90, which does not support +[variable length arrays](https://en.wikipedia.org/wiki/Variable-length_array). When we use `malloc()`, we get a contiguous array of elements. To create a 2D array using `malloc()`, we have to first create an array of pointers (which are contiguous) and allocate memory for each pointer: + +```c +int num_rows = 3, num_cols = 5; + +float **matrix = malloc(num_rows * sizeof(float*)); // Each pointer is the start of a row +for (int i = 0; i < num_rows; ++i) { + matrix[i] = malloc(num_cols * sizeof(float)); // Here we allocate memory to store the column elements for row i +} + +for (int i = 0; i < num_rows; ++i) { + for (int j = 0; i < num_cols; ++j) { + matrix[i][j] = 3.14159; // Indexing is done as matrix[rows][cols] + } +} +``` + +There is one problem though. `malloc()` does not guarantee that subsequently allocated memory will be contiguous. When `malloc()` requests memory, the operating system will assign whatever memory is free. This is not always next to the block of memory from the previous allocation. This makes life tricky, since data *has* to be contiguous for MPI communication. But there are workarounds. One is to only use 1D arrays (with the same number of elements as the higher dimension array) and to map the n-dimensional coordinates into a linear coordinate system. For example, the element +`[2][4]` in a 3 x 5 matrix would be accessed as, + +```c +int index_for_2_4 = matrix1d[5 * 2 + 4]; // num_cols * row + col +``` + +Another solution is to move memory around so that it is contiguous, such as in [this example](code/examples/07-malloc-trick.c) or by using a more sophisticated function such as [`arralloc()` function](code/arralloc.c) (not part of the standard library) which can allocate arbitrary n-dimensional arrays into a contiguous block. +:::: + +For a row-major array, we can send the elements of a single row (for a 4 x 4 matrix) easily, + +```c +MPI_Send(&matrix[1][0], 4, MPI_INT ...); +``` + +The send buffer is `&matrix[1][0]`, which is the memory address of the first element in row 1. As the columns are four elements long, we have specified to only send four integers. Even though we're working here with a 2D array, sending a single row of the matrix is the same as sending a 1D array. Instead of using a pointer to the start of the array, an address to the first element of the row (`&matrix[1][0]`) is used instead. It's not possible to do the same for a column of the matrix, because the elements down the column are not contiguous. + +## Using vectors to send slices of an array + +To send a column of a matrix or array, we have to use a *vector*. A vector is a derived data type that represents multiple (or one) contiguous sequences of elements, which have a regular spacing between them. By using vectors, we can create data types for column vectors, row vectors or sub-arrays, similar to how we can +[create slices for Numpy arrays in Python](https://numpy.org/doc/stable/user/basics.indexing.html), all of which can be sent in a single, efficient, communication. +To create a vector, we create a new data type using `MPI_Type_vector()`, + +```c +int MPI_Type_vector( + int count, + int blocklength, + int stride, + MPI_Datatype oldtype, + MPI_Datatype *newtype +); +``` +| | | +| --- | --- | +| `count`: | The number of "blocks" which make up the vector | +| `blocklength`: | The number of contiguous elements in a block | +| `stride`: | The number of elements between the start of each block | +| `oldtype`: | The data type of the elements of the vector, e.g. MPI_INT, MPI_FLOAT | +| `newtype`: | The newly created data type to represent the vector | + +To understand what the arguments mean, look at the diagram below showing a vector to send two rows of a 4 x 4 matrix +with a row in between (rows 2 and 4), + +![How a vector is laid out in memory"](fig/vector_linear_memory.png) + + +A *block* refers to a sequence of contiguous elements. In the diagrams above, each sequence of contiguous purple or +orange elements represents a block. The *block length* is the number of elements within a block; in the above this is +four. The *stride* is the distance between the start of each block, which is eight in the example. The count is the +number of blocks we want. When we create a vector, we're creating a new derived data type which includes one or more +blocks of contiguous elements. + +::::callout + +## Why is this functionality useful? + +The advantage of using derived types to send vectors is to streamline and simplify communication of complex and non-contiguous data. They are most commonly used where there are boundary regions between MPI ranks, such as in simulations using domain decomposition (see the optional Common Communication Patterns episode for more detail), irregular meshes or composite data structures (covered in the optional Advanced Data Communication episode). +:::: + +Before we can use the vector we create to communicate data, it has to be committed using `MPI_Type_commit()`. This finalises the creation of a derived type. Forgetting to do this step leads to unexpected behaviour, and potentially disastrous consequences! + +```c +int MPI_Type_commit( + MPI_Datatype *datatype // The data type to commit - note that this is a pointer +); +``` + +When a data type is committed, resources which store information on how to handle it are internally allocated. This contains data structures such as memory buffers as well as data used for bookkeeping. Failing to free those resources after finishing with the vector leads to memory leaks, just like when we don't free memory created using `malloc()`. To free up the resources, we use `MPI_Type_free()`, + +```c +int MPI_Type_free ( + MPI_Datatype *datatype // The data type to clean up -- note this is a pointer +); +``` + +The following example code uses a vector to send two rows from a 4 x 4 matrix, as in the example diagram above. + +```c +// The vector is a MPI_Datatype +MPI_Datatype rows_type; + +// Create the vector type +const int count = 2; +const int blocklength = 4; +const int stride = 8; +MPI_Type_vector(count, blocklength, stride, MPI_INT, &rows_type); + +// Don't forget to commit it +MPI_Type_commit(&rows_type); + +// Send the middle row of our 2d matrix array. Note that we are sending +// &matrix[1][0] and not matrix. This is because we are using an offset +// to change the starting point of where we begin sending memory +int matrix[4][4] = { + { 1, 2, 3, 4}, + { 5, 6, 7, 8}, + { 9, 10, 11, 12}, + {13, 14, 15, 16}, +}; + +if (my_rank == 0) { + MPI_Send(&matrix[1][0], 1, rows_type, 1, 0, MPI_COMM_WORLD); +} else { + // The receive function doesn't "work" with vector types, so we have to + // say that we are expecting 8 integers instead + const int num_elements = count * blocklength; + int recv_buffer[num_elements]; + MPI_Recv(recv_buffer, num_elements, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); +} + +// The final thing to do is to free the new data type when we no longer need it +MPI_Type_free(&rows_type); +``` + +There are two things above, which look quite innocent, but are important to understand. First of all, the send buffer in `MPI_Send()` is not `matrix` but `&matrix[1][0]`. In `MPI_Send()`, the send buffer is a pointer to the memory location where the start of the data is stored. In the above example, the intention is to only send the second and forth rows, so the start location of the data to send is the address for element `[1][0]`. If we used `matrix`, the first and third rows would be sent instead. + +The other thing to notice, which is not immediately clear why it's done this way, is that the receive data type is `MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This is because when a rank receives data, the data is contiguous array. We don't need to use a vector to describe the layout of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. + +::::challenge{id=sending-columns, title="Sending columns from an array"} + +Create a vector type to send a column in the following 2 x 3 array: + +```c +int matrix[2][3] = { + {1, 2, 3}, + {4, 5, 6}, + }; +``` + +With that vector type, send the middle column of the matrix (elements `matrix[0][1]` and `matrix[1][1]`) from rank 0 to rank 1 and print the results. You may want to use [this code](code/solutions/skeleton-example.c) as your starting point. + +:::solution + +If your solution is correct you should see 2 and 5 printed to the screen. In the solution below, to send a 2 x 1 column of the matrix, we created a vector with `count = 2`, `blocklength = 1` and `stride = 3`. To send the correct column our send buffer was `&matrix[0][1]` which is the address of the first element in column 1. To see why the stride is 3, take a look at the diagram below, + +![Stride example for question](fig/stride_example_2x3.png) + +You can see that there are *three* contiguous elements between the start of each block of 1. + +```c +#include +#include + +int main(int argc, char **argv) +{ + int my_rank; + int num_ranks; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); + MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); + + int matrix[2][3] = { + {1, 2, 3}, + {4, 5, 6}, + }; + + if (num_ranks != 2) { + if (my_rank == 0) { + printf("This example only works with 2 ranks\n"); + } + MPI_Abort(MPI_COMM_WORLD, 1); + } + + MPI_Datatype col_t; + MPI_Type_vector(2, 1, 3, MPI_INT, &col_t); + MPI_Type_commit(&col_t); + + if (my_rank == 0) { + MPI_Send(&matrix[0][1], 1, col_t, 1, 0, MPI_COMM_WORLD); + } else { + int buffer[2]; + MPI_Status status; + + MPI_Recv(buffer, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); + + printf("Rank %d received the following:", my_rank); + for (int i = 0; i < 2; ++i) { + printf(" %d", buffer[i]); + } + printf("\n"); + } + + MPI_Type_free(&col_t); + + return MPI_Finalize(); +} +``` +::: +:::: + +::::challenge{id=sending-subarrays, title="Sending sub-arrays of an array"} + +By using a vector type, send the middle four elements (6, 7, 10, 11) in the following 4 x 4 matrix from rank 0 to rank +1, + +```c +int matrix[4][4] = { + { 1, 2, 3, 4}, + { 5, 6, 7, 8}, + { 9, 10, 11, 12}, + {13, 14, 15, 16} +}; +``` +You can re-use most of your code from the previous exercise as your starting point, replacing the 2 x 3 matrix with the 4 x 4 matrix above and modifying the vector type and communication functions as required. + +:::solution + +The receiving rank(s) should receive the numbers 6, 7, 10 and 11 if your solution is correct. In the solution below, we have created a vector with a count and block length of 2 and with a stride of 4. The first two arguments means two vectors of block length 2 will be sent. The stride of 4 results from that there are 4 elements between the start of each distinct block as shown in the image below, + +![Stride example for subarray question](fig/stride_example_4x4.png) + +You must always remember to send the address for the starting point of the *first* block as the send buffer, which +is why `&array[1][1]` is the first argument in `MPI_Send()`. + +```c +#include +#include + +int main(int argc, char **argv) +{ + int matrix[4][4] = { + { 1, 2, 3, 4}, + { 5, 6, 7, 8}, + { 9, 10, 11, 12}, + {13, 14, 15, 16} + }; + + int my_rank; + int num_ranks; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); + MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); + + if (num_ranks != 2) { + if (my_rank == 0) { + printf("This example only works with 2 ranks\n"); + } + MPI_Abort(MPI_COMM_WORLD, 1); + } + + MPI_Datatype sub_array_t; + MPI_Type_vector(2, 2, 4, MPI_INT, &sub_array_t); + MPI_Type_commit(&sub_array_t); + + if (my_rank == 0) { + MPI_Send(&matrix[1][1], 1, sub_array_t, 1, 0, MPI_COMM_WORLD); + } else { + int buffer[4]; + MPI_Status status; + + MPI_Recv(buffer, 4, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); + + printf("Rank %d received the following:", my_rank); + for (int i = 0; i < 4; ++i) { + printf(" %d", buffer[i]); + } + printf("\n"); + } + + MPI_Type_free(&sub_array_t); + + return MPI_Finalize(); +} +``` +::: +:::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/07_advanced_communication.md b/high_performance_computing/hpc_mpi/07_advanced_communication.md deleted file mode 100644 index 263da0a9..00000000 --- a/high_performance_computing/hpc_mpi/07_advanced_communication.md +++ /dev/null @@ -1,933 +0,0 @@ ---- -name: Advanced Communication Techniques -dependsOn: [ - high_performance_computing.hpc_mpi.06_non_blocking_communication -] -tags: [] -attribution: - - citation: > - "Introduction to the Message Passing Interface" course by the Southampton RSG - url: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/ - image: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/assets/img/home-logo.png - license: CC-BY-4.0 -learningOutcomes: - - Understand the problems of non-contiguous memory in MPI. - - Know how to define and use derived datatypes. - ---- - -We've so far seen the basic building blocks for splitting work and communicating data between ranks, meaning we're now dangerous enough to write a simple and successful MPI application. -We've worked, so far, with simple data structures, such as single variables or small 1D arrays. -In reality, any useful software we write will use more complex data structures, such as structures, n-dimensional arrays and other complex types. -Working with these in MPI require a bit more work to communicate them correctly and efficiently. - -To help with this, MPI provides an interface to create new types known as *derived datatypes*. -A derived type acts as a way to enable the translation of complex data structures into instructions which MPI uses for efficient data access -communication. - -::::callout - -## Size limitations for messages - -All throughout MPI, the argument which says how many elements of data are being communicated is an integer: `int count`. -In most 64-bit Linux systems, `int`'s are usually 32-bit and so the biggest number you can pass to `count` is `2^31 - 1 = 2,147,483,647`, which is about 2 billion. -Arrays which exceed this length can't be communicated easily in versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. -In older MPI versions, there are two workarounds to this limitation. -The first is to communicate large arrays in smaller, more manageable chunks. -The other is to use derived types, to re-shape the data. -:::: - -## Multi-dimensional arrays - -Almost all scientific and computing problems nowadays require us to think in more than one dimension. -Using multi-dimensional arrays, such for matrices or tensors, or discretising something onto a 2D or 3D grid of points are fundamental parts for most scientific software. -However, the additional dimensions comes with additional complexity, not just in the code we write, but also in how data is communicated. - -To create a 2 x 3 matrix, in C, and initialize it with some values, we use the following syntax: - -```c -int matrix[2][3] = { {1, 2, 3}, {4, 5, 6} }; // matrix[rows][cols] -``` - -This creates an array with two rows and three columns. -The first row contains the values `{1, 2, 3}` and the second row contains `{4, 5, 6}`. The number of rows and columns can be any value, as long as there is enough memory available. - -### The importance of memory contiguity - -When a sequence of things is contiguous, it means there are multiple adjacent things without anything in between them. -In the context of MPI, when we talk about something being contiguous we are almost always talking about how arrays, and other complex data structures, are stored in the computer's memory. -The elements in an array are contiguous when the next, or previous, element are stored in the adjacent memory location. - -The memory space of a computer is linear. -When we create a multi-dimensional array, the compiler and operating system decide how to map and store the elements into that linear space. There are two ways to do this: [row-major or column-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order). -The difference is which elements of the array are contiguous in memory. -Arrays are row-major in C and column-major in Fortran. -In a row-major array, the elements in each column of a row are contiguous, so element `x[i][j]` is preceded by `x[i][j - 1]` and is followed by `x[i][j +1]`. -In Fortran, arrays are column-major so `x(i, j)` is followed by `x(i + 1, j)` and so on. - -The diagram below shows how a 4 x 4 matrix is mapped onto a linear memory space, for a row-major array. -At the top of the diagram is the representation of the linear memory space, where each number is ID of the element in memory. -Below that are two representations of the array in 2D: the left shows the coordinate of each element and the right shows the ID of the element. - -![Column memory layout in C](fig/c_column_memory_layout.png) - -The purple elements (5, 6, 7, 8) which map to the coordinates `[1][0]`, `[1][1]`, `[1][2]` and `[1][3]` are contiguous in linear memory. -The same applies for the orange boxes for the elements in row 2 (elements 9, 10, 11 and 12). -Columns in row-major arrays are contiguous. -The next diagram instead shows how elements in adjacent rows are mapped in memory. - -![Row memory layout in C](fig/c_row_memory_layout.png) - -Looking first at the purple boxes (containing elements 2, 6, 10 and 14) which make up the row elements for column 1, we can see that the elements are not contiguous. -Element `[0][1]` maps to element 2 and element `[1][1]` maps to element 6 and so on. -Elements in the same column but in a different row are separated by four other elements, in this example. -In other words, elements in other rows are not contiguous. - -:::::challenge{id=memory-contiguity-performance, title="Does memory contiguity affect performance?"} -Do you think memory contiguity could impact the performance of our software, in a negative way? - -::::solution -Yes, memory contiguity can affect how fast our programs run. -When data is stored in a neat and organized way, the computer can find and use it quickly. -But if the data is scattered around randomly (fragmented), it takes more time to locate and use it, which decreases performance. -Keeping our data and data access patterns organized can make our programs faster. -But we probably won't notice the difference for small arrays and data structures. -:::: -::::: - -::::callout - -## What about if I use `malloc()`? - -More often than not, we will see `malloc()` being used to allocate memory for arrays. -Especially if the code is using an older standard, such as C90, which does not support [variable length arrays](https://en.wikipedia.org/wiki/Variable-length_array). -When we use `malloc()`, we get a contiguous array of elements. -To create a 2D array using `malloc()`, we have to first create an array of pointers (which are contiguous) and allocate memory for each pointer: - -```c -int num_rows = 3, num_cols = 5; - -float **matrix = malloc(num_rows * sizeof(float*)); /* Each pointer is the start of a row */ -for (int i = 0; i < num_rows; ++i) { - matrix[i] = malloc(num_cols * sizeof(float)); /* Here we allocate memory to store the column elements for row i */ -} - -for (int i = 0; i < num_rows; ++i) { - for (int j = 0; i < num_cols; ++j) { - matrix[i][j] = 3.14159; /* Indexing is done as matrix[rows][cols] */ - } -} -``` - -There is one problem though. `malloc()` *does not* guarantee that subsequently allocated memory will be contiguous. -When `malloc()` requests memory, the operating system will assign whatever memory is free. -This is not always next to the block of memory from the previous allocation. -This makes life tricky, since data *has* to be contiguous for MPI communication. -But there are workarounds. -One is to only use 1D arrays (with the same number of elements as the higher dimension array) and to map the n-dimensional coordinates into a linear coordinate system. -For example, the element `[2][4]` in a 3 x 5 matrix would be accessed as: - -```c -int index_for_2_4 = matrix1d[5 * 2 + 4]; // num_cols * row + col -``` - -Another solution is to move memory around so that it is contiguous, such as in [this example](code/examples/07-malloc-trick.c) or by using a more sophisticated function such as [`arralloc()` function](code/arralloc.c) (not part of the standard library) which can allocate arbitrary n-dimensional arrays into a contiguous block. -:::: - -For a row-major array, we can send the elements of a single row (for a 4 x 4 matrix) easily: - -```c -MPI_Send(&matrix[1][0], 4, MPI_INT ...); -``` - -The send buffer is `&matrix[1][0]`, which is the memory address of the first element in row 1. -As the columns are four elements long, we have specified to only send four integers. -Even though we're working here with a 2D array, sending a single row of the matrix is the same as sending a 1D array. -Instead of using a pointer to the start of the array, an address to the first element of the row (`&matrix[1][0]`) is used instead. -It's not possible to do the same for a column of the matrix, because the elements down the column are not contiguous. - -### Using vectors to send slices of an array - -To send a column of a matrix, we have to use a *vector*. -A vector is a derived datatype that represents multiple (or one) contiguous sequences of elements, which have a regular spacing between them. -By using vectors, we can create data types for column vectors, row vectors or sub-arrays, similar to how we can [create slices for Numpy arrays in Python](https://numpy.org/doc/stable/user/basics.indexing.html), all of which can be sent in a single, efficient, communication. -To create a vector, we create a new datatype using `MPI_Type_vector()`: - -```c -int MPI_Type_vector( - int count, /* The number of 'blocks' which makes up the vector */ - int blocklength, /* The number of contiguous elements in a block */ - int stride, /* The number of elements between the start of each block */ - MPI_Datatype oldtype, /* The datatype of the elements of the vector, e.g. MPI_INT, MPI_FLOAT */ - MPI_Datatype *newtype /* The new datatype which represents the vector - note that this is a pointer */ -); -``` - -To understand what the arguments mean, look at the diagram below showing a vector to send two rows of a 4 x 4 matrix with a row in between (rows 2 and 4): - -![How a vector is laid out in memory](fig/vector_linear_memory.png) - -A *block* refers to a sequence of contiguous elements. -In the diagrams above, each sequence of contiguous purple or orange elements represents a block. -The *block length* is the number of elements within a block; in the above this is four. -The *stride* is the distance between the start of each block, which is eight in the example. -The count is the number of blocks we want. -When we create a vector, we're creating a new derived datatype which includes one or more blocks of contiguous elements. - -Before we can use the vector we create to communicate data, it has to be committed using `MPI_Type_commit()`. -This finalises the creation of a derived type. -Forgetting to do this step leads to unexpected behaviour, and potentially disastrous consequences! - -```c -int MPI_Type_commit( - MPI_Datatype *datatype /* The datatype to commit - note that this is a pointer */ -); -``` - -When a datatype is committed, resources which store information on how to handle it are internally allocated. -This contains data structures such as memory buffers as well as data used for bookkeeping. -Failing to free those resources after finishing with the vector leads to memory leaks, just like when we don't free memory created using `malloc()`. -To free up the resources, we use `MPI_Type_free()`, - -```c -int MPI_Type_free ( - MPI_Datatype *datatype /* The datatype to clean up -- note this is a pointer */ -); -``` - -The following example code uses a vector to send two rows from a 4 x 4 matrix, as in the example diagram above. - -```c -/* The vector is a MPI_Datatype */ -MPI_Datatype rows_type; - -/* Create the vector type */ -const int count = 2; -const int blocklength = 4; -const int stride = 8; -MPI_Type_vector(count, blocklength, stride, MPI_INT, &rows_type); - -/* Don't forget to commit it */ -MPI_Type_commit(&rows_type); - -/* Send the middle row of our 2d send_buffer array. Note that we are sending - &send_buffer[1][0] and not send_buffer. This is because we are using an offset - to change the starting point of where we begin sending memory */ -int matrix[4][4] = { - { 1, 2, 3, 4}, - { 5, 6, 7, 8}, - { 9, 10, 11, 12}, - {13, 14, 15, 16}, -}; - -MPI_Send(&matrix[1][0], 1, rows_type, 1, 0, MPI_COMM_WORLD); - -/* The receive function doesn't "work" with vector types, so we have to - say that we are expecting 8 integers instead */ -const int num_elements = count * blocklength; -int recv_buffer[num_elements]; -MPI_Recv(recv_buffer, num_elements, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); - -/* The final thing to do is to free the new datatype when we no longer need it */ -MPI_Type_free(&rows_type); -``` - -There are two things above, which look quite innocent, but are important to understand. -First of all, the send buffer in `MPI_Send()` is not `matrix` but `&matrix[1][0]`. -In `MPI_Send()`, the send buffer is a pointer to the memory location where the start of the data is stored. -In the above example, the intention is to only send the second and forth rows, so the start location of the data to send is the address for element `[1][0]`. -If we used `matrix`, the first and third rows would be sent instead. - -The other thing to notice, which is not immediately clear why it's done this way, is that the receive datatype is `MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. -This is because when a rank receives data, the data is contiguous array. -We don't need to use a vector to describe the layout of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. - -:::::challenge{id=sending-columns, title="Sending columns from an array"} -Create a vector type to send a column in the following 2 x 3 array: - -```c -int matrix[2][3] = { - {1, 2, 3}, - {4, 5, 6}, -}; -``` - -With that vector type, send the middle column of the matrix (elements `matrix[0][1]` and `matrix[1][1]`) from rank 0 to rank 1 and print the results. -You may want to use [this code](code/solutions/skeleton-example.c) as your starting point. - -::::solution -If your solution is correct you should see 2 and 5 printed to the screen. -In the solution below, to send a 2 x 1 column of the matrix, we created a vector with `count = 2`, `blocklength = 1` and `stride = 3`. -To send the correct column our send buffer was `&matrix[0][1]` which is the address of the first element in column 1. -To see why the stride is 3, take a look at the diagram below: - -![Stride example for question](fig/stride_example_2x3.png) - -You can see that there are *three* contiguous elements between the start of each block of 1. - -```c -#include -#include - -int main(int argc, char **argv) -{ - int my_rank; - int num_ranks; - MPI_Init(&argc, &argv); - MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); - MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); - - int matrix[2][3] = { - {1, 2, 3}, - {4, 5, 6}, - }; - - if (num_ranks != 2) { - if (my_rank == 0) { - printf("This example only works with 2 ranks\n"); - } - MPI_Abort(MPI_COMM_WORLD, 1); - } - - MPI_Datatype col_t; - MPI_Type_vector(2, 1, 3, MPI_INT, &col_t); - MPI_Type_commit(&col_t); - - if (my_rank == 0) { - MPI_Send(&matrix[0][1], 1, col_t, 1, 0, MPI_COMM_WORLD); - } else { - int buffer[2]; - MPI_Status status; - - MPI_Recv(buffer, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); - - printf("Rank %d received the following:", my_rank); - for (int i = 0; i < 2; ++i) { - printf(" %d", buffer[i]); - } - printf("\n"); - } - - MPI_Type_free(&col_t); - return MPI_Finalize(); -} -``` - -:::: -::::: - -:::::challenge{id=sending-sub-arrays, title="Sending Sub-Arrays of an Array"} -By using a vector type, send the middle four elements (6, 7, 10, 11) in the following 4 x 4 matrix from rank 0 to rank 1: - -```c -int matrix[4][4] = { - { 1, 2, 3, 4}, - { 5, 6, 7, 8}, - { 9, 10, 11, 12}, - {13, 14, 15, 16} -}; -``` - -You can re-use most of your code from the previous exercise as your starting point, replacing the 2 x 3 matrix with the 4 x 4 matrix above and modifying the vector type and communication functions as required. - -::::solution -The receiving rank(s) should receive the numbers 6, 7, 10 and 11 if your solution is correct. -In the solution below, we have created a vector with a count and block length of 2 and with a stride of 4. -The first two arguments means two vectors of block length 2 will be sent. -The stride of 4 results from that there are 4 elements between the start of each distinct block as shown in the image below: - -![Stride example for question](fig/stride_example_4x4.png) - -You must always remember to send the address for the starting point of the *first* block as the send buffer, which is why `&array[1][1]` is the first argument in `MPI_Send()`. - -```c -#include -#include - -int main(int argc, char **argv) -{ - int matrix[4][4] = { - { 1, 2, 3, 4}, - { 5, 6, 7, 8}, - { 9, 10, 11, 12}, - {13, 14, 15, 16} - }; - - int my_rank; - int num_ranks; - MPI_Init(&argc, &argv); - MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); - MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); - - if (num_ranks != 2) { - if (my_rank == 0) { - printf("This example only works with 2 ranks\n"); - } - MPI_Abort(MPI_COMM_WORLD, 1); - } - - MPI_Datatype sub_array_t; - MPI_Type_vector(2, 2, 4, MPI_INT, &sub_array_t); - MPI_Type_commit(&sub_array_t); - - if (my_rank == 0) { - MPI_Send(&matrix[1][1], 1, sub_array_t, 1, 0, MPI_COMM_WORLD); - } else { - int buffer[4]; - MPI_Status status; - - MPI_Recv(buffer, 4, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); - - printf("Rank %d received the following:", my_rank); - for (int i = 0; i < 4; ++i) { - printf(" %d", buffer[i]); - } - printf("\n"); - } - - MPI_Type_free(&sub_array_t); - - return MPI_Finalize(); -} -``` - -:::: -::::: - -## Structures in MPI - -Structures, commonly known as structs, are custom datatypes which contain multiple variables of (usually) different types. -Some common use cases of structs, in scientific code, include grouping together constants or global variables, or they are used to represent a physical thing, such as a particle, or something more abstract like a cell on a simulation grid. -When we use structs, we can write clearer, more concise and better structured code. - -To communicate a struct, we need to define a derived datatype which tells MPI about the layout of the struct in memory. -Instead of `MPI_Type_create_vector()`, for a struct, we use, `MPI_Type_create_struct()`: - -```c -int MPI_Type_create_struct( - int count, /* The number of members/fields in the struct */ - int *array_of_blocklengths, /* The length of the members/fields, as you would use in MPI_Send */ - MPI_Aint *array_of_displacements, /* The relative positions of each member/field in bytes */ - MPI_Datatype *array_of_types, /* The MPI type of each member/field */ - MPI_Datatype *newtype, /* The new derived datatype */ -); -``` - -The main difference between vector and struct derived types is that the arguments for structs expect arrays, since structs are made up of multiple variables. -Most of these arguments are straightforward, given what we've just seen for defining vectors. -But `array_of_displacements` is new and unique. - -When a struct is created, it occupies a single contiguous block of memory. But there is a catch. -For performance reasons, compilers insert arbitrary "padding" between each member. -This padding, known as [data structure alignment](https://en.wikipedia.org/wiki/Data_structure_alignment), optimises both the layout of the memory -and the access of it. -As a result, the memory layout of a struct may look like this instead: - -![Memory layout for a struct](fig/struct_memory_layout.png) - -Although the memory used for padding and the struct's data exists in a contiguous block, the actual data we care about is not contiguous any more. -This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. -In practise, it serves a similar purpose of the stride in vectors. - -To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. -To do this, we can use the function `MPI_Get_address()`: - -```c -int MPI_Get_address{ - const void *location, /* A pointer to the variable we want the address for */ - MPI_Aint *address, /* The address of the variable, as an MPI Address Integer -- returned via pointer */ -}; -``` - -In the following example, we use `MPI_Type_create_struct()` and `MPI_Get_address()` to create a derived type for a struct with two members: - -```c -/* Define and initialize a struct, named foo, with an int and a double */ -struct MyStruct { - int id; - double value; -} foo = {.id = 0, .value = 3.1459}; - -/* Create arrays to describe the length of each member and their type */ -int count = 2; -int block_lengths[2] = {1, 1}; -MPI_Datatype block_types[2] = {MPI_INT, MPI_DOUBLE}; - -/* Now we calculate the displacement of each member, which are stored in an - MPI_Aint designed for storing memory addresses */ -MPI_Aint base_address; -MPI_Aint block_offsets[2]; - -MPI_Get_address(&foo, &base_address); /* First of all, we find the address of the start of the struct */ -MPI_Get_address(&foo.id, &block_offsets[0]); /* Now the address of the first member "id" */ -MPI_Get_address(&foo.value, &block_offsets[1]); /* And the second member "value" */ - -/* Calculate the offsets, by subtracting the address of each field from the - base address of the struct */ -for (int i = 0; i < 2; ++i) { - /* MPI_Aint_diff is a macro to calculate the difference between two - MPI_Aints and is a replacement for: - (MPI_Aint) ((char *) block_offsets[i] - (char *) base_address) */ - block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); -} - -/* We finally can create out struct data type */ -MPI_Datatype struct_type; -MPI_Type_create_struct(count, block_lengths, block_offsets, block_types, &struct_type); -MPI_Type_commit(&struct_type); - -/* Another difference between vector and struct derived types is that in - MPI_Recv, we use the struct type. We have to do this because we aren't - receiving a contiguous block of a single type of date. By using the type, we - tell MPI_Recv how to understand the mix of data types and padding and how to - assign those back to recv_struct */ -if (my_rank == 0) { - MPI_Send(&foo, 1, struct_type, 1, 0, MPI_COMM_WORLD); -} else { - struct MyStruct recv_struct; - MPI_Recv(&recv_struct, 1, struct_type, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); -} - -/* Remember to free the derived type */ -MPI_Type_free(&struct_type); -``` - -:::::challenge{id=sending-a-struct, title="Sending a Struct"} -By using a derived data type, write a program to send the following struct `struct Node node` from one rank to another: - -```c -struct Node { - int id; - char name[16]; - double temperature; -}; - -struct Node node = { .id = 0, .name = "Dale Cooper", .temperature = 42}; -``` - -You may wish to use [this skeleton code](code/solutions/skeleton-example.c) as your stating point. - -::::solution -Your solution should look something like the code block below. When sending a *static* array (`name[16]`), we have to use a count of 16 in the `block_lengths` array for that member. - -```c -#include -#include - -struct Node { - int id; - char name[16]; - double temperature; -}; - -int main(int argc, char **argv) -{ - int my_rank; - int num_ranks; - MPI_Init(&argc, &argv); - MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); - MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); - - if (num_ranks != 2) { - if (my_rank == 0) { - printf("This example only works with 2 ranks\n"); - } - MPI_Abort(MPI_COMM_WORLD, 1); - } - - struct Node node = {.id = 0, .name = "Dale Cooper", .temperature = 42}; - - int block_lengths[3] = {1, 16, 1}; - MPI_Datatype block_types[3] = {MPI_INT, MPI_CHAR, MPI_DOUBLE}; - - MPI_Aint base_address; - MPI_Aint block_offsets[3]; - MPI_Get_address(&node, &base_address); - MPI_Get_address(&node.id, &block_offsets[0]); - MPI_Get_address(&node.name, &block_offsets[1]); - MPI_Get_address(&node.temperature, &block_offsets[2]); - for (int i = 0; i < 3; ++i) { - block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); - } - - MPI_Datatype node_struct; - MPI_Type_create_struct(3, block_lengths, block_offsets, block_types, &node_struct); - MPI_Type_commit(&node_struct); - - if (my_rank == 0) { - MPI_Send(&node, 1, node_struct, 1, 0, MPI_COMM_WORLD); - } else { - struct Node recv_node; - MPI_Recv(&recv_node, 1, node_struct, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); - printf( - "Received node: id = %d name = %s temperature %f\n", - recv_node.id, recv_node.name, recv_node.temperature - ); - } - - MPI_Type_free(&node_struct); - - return MPI_Finalize(); -} -``` - -:::: -::::: - -:::::challenge{id=what-if-pointers, title="What If I Have a Pointer in My Struct?"} -Suppose we have the following struct with a pointer named `position` and some other fields: - -```c -struct Grid { - double *position; - int num_cells; -}; -grid.position = malloc(3 * sizeof(double)); -``` - -If we use `malloc()` to allocate memory for `position`, how would we send data in the struct and the memory we allocated one rank to another? -If you are unsure, try writing a short program to create a derived type for the struct. - -::::solution -The short answer is that we can't do it using a derived type, and will have to *manually* communicate the data separately. -The reason why can't use a derived type is because the address of `*position` is the address of the pointer. -The offset between `num_cells` and `*position` is the size of the pointer and whatever padding the compiler adds. -It is not the data which `position` points to. -The memory we allocated for `*position` is somewhere else in memory, as shown in the diagram below, and is non-contiguous with respect to the fields in the struct. - -![Memory layout for a struct with a pointer](fig/struct_with_pointer.png) -:::: -::::: - -::::callout - -## A different way to calculate displacements - -There are other ways to calculate the displacement, other than using what MPI provides for us. -Another common way is to use the `offsetof()` macro part of ``. `offsetof()` accepts two arguments, the first being the struct type and the second being the member to calculate the offset for. - -```c -#include -MPI_Aint displacements[2]; -displacements[0] = (MPI_Aint) offsetof(struct MyStruct, id); /* The cast to MPI_Aint is for extra safety */ -displacements[1] = (MPI_Aint) offsetof(struct MyStruct, value); -``` - -This method and the other shown in the previous examples both returns the same displacement values. -It's mostly a personal choice which you choose to use. -Some people prefer the "safety" of using `MPI_Get_address()` whilst others prefer to write more concise code with `offsetof()`. -Of course, if you're a Fortran programmer then you can't use the macro! -:::: - -## Dealing with other non-contiguous data - -The previous two sections covered how to communicate complex but structured data between ranks using derived datatypes. -However, there are *always* some edge cases which don't fit into a derived types. -For example, just in the last exercise we've seen that pointers and derived types don't mix well. -Furthermore, we can sometimes also reach performance bottlenecks when working with heterogeneous data which doesn't fit, or doesn't make sense to be, in a derived type, as each data type needs to be communicated in separate communication calls. -This can be especially bad if blocking communication is used! -For edge cases situations like this, we can use the `MPI_Pack()` and `MPI_Unpack()` functions to do things ourselves. - -Both `MPI_Pack()` and `MPI_Unpack()` are methods for manually arranging, packing and unpacking data into a contiguous buffer, for cases where regular communication methods and derived types don't work well or efficiently. -They can also be used to create self-documenting message, where the packed data contains additional elements which describe the size, structure and contents of the data. -But we have to be careful, as using packed buffers comes with additional overhead, in the form of increased memory usage and potentially more communication overhead as packing and unpacking data is not free. - -When we use `MPI_Pack()`, we take non-contiguous data (sometimes of different datatypes) and "pack" it into a contiguous memory buffer. -The diagram below shows how two (non-contiguous) chunks of data may be packed into a contiguous array using `MPI_Pack()`. - -![Layout of packed memory](fig/packed_buffer_layout.png) - -The coloured boxes in both memory representations (memory and pakced) are the same chunks of data. -The green boxes containing only a single number are used to document the number of elements in the block of elements they are adjacent to, in the contiguous buffer. -This is optional to do, but is generally good practise to include to create a self-documenting message. -From the diagram we can see that we have "packed" non-contiguous blocks of memory into a single contiguous block. -We can do this using `MPI_Pack()`. To reverse this action, and "unpack" the buffer, we use `MPI_Unpack()`. -As you might expect, `MPI_Unpack()` takes a buffer, created by `MPI_Pack()` and unpacks the data back into various memory address. - -To pack data into a contiguous buffer, we have to pack each block of data, one by one, into the contiguous buffer using the `MPI_Pack()` function: - -```c -int MPI_Pack( - const void *inbuf, /* The data we want to put into the buffer */ - int incount, /* The number of elements of the buffer */ - MPI_Datatype datatype, /* The datatype of the elements */ - void *outbuf, /* The contiguous buffer to pack the data into */ - int outsize, /* The size of the contiguous buffer, in bytes */ - int *position, /* A counter of how far into the contiguous buffer to write to */ - MPI_Comm comm /* The communicator the packed message will be sent using */ -); -``` - -In the above, `inbuf` is the data we want to pack into a contiguous buffer and `incount` and `datatype` define the number of elements in and the datatype of `inbuf`. -The parameter `outbuf` is the contiguous buffer the data is packed into, with `outsize` being the total size of the buffer in *bytes*. -The `position` argument is used to keep track of the current position, in bytes, where data is being packed into `outbuf`. - -Uniquely, `MPI_Pack()`, and `MPI_Unpack()` as well, measure the size of the contiguous buffer, `outbuf`, in bytes rather than in number of elements. -Given that `MPI_Pack()` is all about manually arranging data, we have to also manage the allocation of memory for `outbuf`. -But how do we allocate memory for it, and how much should we allocate? -Allocation is done by using `malloc()`. -Since `MPI_Pack()` works with `outbuf` in terms of bytes, the convention is to declare `outbuf` as a `char *`. -The amount of memory to allocate is simply the amount of space, in bytes, required to store all of the data we want to pack into it. -Just like how we would normally use `malloc()` to create an array. -If we had an integer array and a floating point array which we wanted to pack into the buffer, then the size required is easy to calculate: - -```c -/* The total buffer size is the sum of the bytes required for the int and float array */ -int size_int_array = num_int_elements * sizeof(int); -int size_float_array = num_float_elements * sizeof(float); -int buffer_size = size_int_array + size_float_array; -/* The buffer is a char *, but could also be cast as void * if you prefer */ -char *buffer = malloc(buffer_size * sizeof(char)); // a char is 1 byte, so sizeof(char) is optional -``` - -If we are also working with derived types, such as vectors or structs, then we need to find the size of those types. -By far the easiest way to handle these is to use `MPI_Pack_size()`, which supports derived datatypes through the `MPI_Datatype`: - -```c -int MPI_Pack_size( - int incount, /* The number of elements in the data */ - MPI_Datatype datatype, /* The datatype of the data*/ - MPI_Comm comm, /* The communicator the data will be sent over */ - int *size /* The calculated upper size limit for the buffer, in bytes */ -); -``` - -`MPI_Pack_size()` is a helper function to calculate the *upper bound* of memory required. -It is, in general, preferable to calculate the buffer size using this function, as it takes into account any implementation specific MPI detail and thus is more portable between implementations and systems. -If we wanted to calculate the memory required for three elements of some derived struct type and a `double` array, we would do the following: - -```c -int struct_array_size, float_array_size; -MPI_Pack_size(3, STRUCT_DERIVED_TYPE, MPI_COMM_WORLD, &struct_array_size); -MPI_Pack_size(50, MPI_DOUBLE. MPI_COMM_WORLD, &float_array_size); -int buffer_size = struct_array_size + float_array_size; -``` - -When a rank has received a contiguous buffer, it has to be unpacked into its constituent parts, one by one, using `MPI_Unpack()`: - -```c -int MPI_Unpack( - const void *inbuf, /* The contiguous buffer to unpack */ - int insize, /* The total size of the buffer, in bytes */ - int *position, /* The position, in bytes, for where to start unpacking from */ - void *outbuf, /* An array, or variable, to unpack data into -- this is the output */ - int outcount, /* The number of elements of data to unpack */ - MPI_Datatype datatype, /* The datatype of the elements to unpack */ - MPI_Comm comm, /* The communicator the message was sent using */ -); -``` - -The arguments for this function are essentially the reverse of `MPI_Pack()`. -Instead of being the buffer to pack into, `inbuf` is now the packed buffer and `position` is the position, in bytes, in the buffer where to unpacking from. -`outbuf` is then the variable we want to unpack into, and `outcount` is the number of elements of `datatype` to unpack. - -In the example below, `MPI_Pack()`, `MPI_Pack_size()` and `MPI_Unpack()` are used to communicate a (non-contiguous) 3 x 3 matrix. - -```c -/* Allocate and initialise a (non-contiguous) 2D matrix that we will pack into - a buffer */ -int num_rows = 3, num_cols = 3; -int **matrix = malloc(num_rows * sizeof(int *)); -for (int i = 0; i < num_rows; ++i) { - matrix[i] = malloc(num_cols * sizeof(int)); - for (int j = 0; i < num_cols; ++j) { - matrix[i][j] = num_cols * i + j; - } -} - -/* Determine the upper limit for the amount of memory the buffer requires. Since - this is a simple situation, we could probably have done this manually using - `num_rows * num_cols * sizeof(int)`. The size `max_buffer_size` is returned in - bytes */ -int max_buffer_size; -MPI_Pack_size(num_rows * num_cols, MPI_INT, MPI_COMM_WORLD, &max_buffer_size); - -/* The buffer we are packing into has to be allocated, note that it is a -char* array. That's because a char is 1 byte and packing and unpacking works in -bytes */ -char *packed_data = malloc(max_buffer_size); - -/* Pack each (non-contiguous) row into the packed buffer */ -int position = 0; -for (int i = 0; i < num_rows; ++i) { - MPI_Pack(matrix[i], num_cols, MPI_INT, packed_data, pack_buffer_size, &position,MPI_COMM_WORLD); -} - -/* Send the packed data to rank 1. To send a packed array, we use the MPI_PACKED - datatype with the count being the size of the buffer in bytes. To send and receive - the packed data, we can use any of the communication functions */ -MPI_Send(packed_data, max_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); - -/* To receive packed data, we have to allocate another buffer and receive - MPI_PACKED elements into it */ -char *received_data = malloc(max_buffer_size); -MPI_Recv(received_data, max_buffer_size, MPI_PACKED, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); - -/* Once we have the packed buffer, we can then unpack the data into the rows - of my_matrix */ -int position = 0; -int my_matrix[num_rows][num_cols]; -for (int i = 0; i < num_rows; ++i) { - MPI_Unpack(received_data, max_buffer_size, &position, my_matrix[i], num_cols, MPI_INT, MPI_COMM_WORLD); -} -``` - -::::callout - -## Blocking or non-blocking? - -The processes of packing data into a contiguous buffer does not happen asynchronously. -The same goes for unpacking data. But this doesn't restrict the packed data from being only sent synchronously. -The packed data can be communicated using any communication function, just like the previous derived types. -It works just as well to communicate the buffer using non-blocking methods, as it does using blocking methods. -:::: - -::::callout - -## What if the other rank doesn't know the size of the buffer? - -In some cases, the receiving rank may not know the size of the buffer used in `MPI_Pack()`. -This could happen if a message is sent and received in different functions, if some ranks have different branches through the program or if communication happens in a dynamic or non-sequential way. - -In these situations, we can use `MPI_Probe()` and `MPI_Get_count` to find the a message being sent and to get the number of elements in the message. - -```c -/* First probe for a message, to get the status of it */ -MPI_Status status; -MPI_Probe(0, 0, MPI_COMM_WORLD, &status); -/* Using MPI_Get_count we can get the number of elements of a particular data type */ -int message_size; -MPI_Get_count(&status, MPI_PACKED, &buffer_size); -/* MPI_PACKED represents an element of a "byte stream." So, buffer_size is the size of the buffer to allocate */ -char *buffer = malloc(buffer_size); -``` - -:::: - -:::::challenge{id=heterogeneous-data, title="Sending Heterogeneous Data in a Single Communication"} -Suppose we have two arrays below, where one contains integer data and the other floating point data. -Normally we would use multiple communication calls to send each type of data individually, for a known number of elements. -For this exercise, communicate both arrays using a packed memory buffer. - -```c -int int_data_count = 5; -int float_data_count = 10; - -int *int_data = malloc(int_data_count * sizeof(int)); -float *float_data = malloc(float_data_count * sizeof(float)); - -/* Initialize the arrays with some values */ -for (int i = 0; i < int_data_count; ++i) { - int_data[i] = i + 1; -} -for (int i = 0; i < float_data_count; ++i) { - float_data[i] = 3.14159 * (i + 1); -} -``` - -Since the arrays are dynamically allocated, in rank 0, you should also pack the number of elements in each array. -Rank 1 may also not know the size of the buffer. How would you deal with that? - -You can use this [skeleton code](code/solutions/08-pack-skeleton.c) to begin with. - -::::solution -The additional restrictions for rank 1 not knowing the size of the arrays or packed buffer add some complexity to receiving the packed buffer from rank 0. - -```c -#include -#include -#include - -int main(int argc, char **argv) -{ - int my_rank; - int num_ranks; - MPI_Init(&argc, &argv); - MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); - MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); - - if (num_ranks != 2) { - if (my_rank == 0) { - printf("This example only works with 2 ranks\n"); - } - MPI_Abort(MPI_COMM_WORLD, 1); - } - - if (my_rank == 0) { - int int_data_count = 5, float_data_count = 10; - int *int_data = malloc(int_data_count * sizeof(int)); - float *float_data = malloc(float_data_count * sizeof(float)); - for (int i = 0; i < int_data_count; ++i) { - int_data[i] = i + 1; - } - for (int i = 0; i < float_data_count; ++i) { - float_data[i] = 3.14159 * (i + 1); - } - - /* use MPI_Pack_size to determine how big the packed buffer needs to be */ - int buffer_size_count, buffer_size_int, buffer_size_float; - MPI_Pack_size(2, MPI_INT, MPI_COMM_WORLD, &buffer_size_count); /* 2 * INT because we will have 2 counts*/ - MPI_Pack_size(int_data_count, MPI_INT, MPI_COMM_WORLD, &buffer_size_int); - MPI_Pack_size(float_data_count, MPI_FLOAT, MPI_COMM_WORLD, &buffer_size_float); - int total_buffer_size = buffer_size_int + buffer_size_float + buffer_size_count; - - int position = 0; - char *buffer = malloc(total_buffer_size); - - /* Pack the data size, followed by the actually data */ - MPI_Pack(&int_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); - MPI_Pack(int_data, int_data_count, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); - MPI_Pack(&float_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); - MPI_Pack(float_data, float_data_count, MPI_FLOAT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); - - /* buffer is sent in one communication using MPI_PACKED */ - MPI_Send(buffer, total_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); - - free(buffer); - free(int_data); - free(float_data); - } else { - int buffer_size; - MPI_Status status; - MPI_Probe(0, 0, MPI_COMM_WORLD, &status); - MPI_Get_count(&status, MPI_PACKED, &buffer_size); - - char *buffer = malloc(buffer_size); - MPI_Recv(buffer, buffer_size, MPI_PACKED, 0, 0, MPI_COMM_WORLD, &status); - - int position = 0; - int int_data_count, float_data_count; - - /* Unpack an integer why defines the size of the integer array, - then allocate space for an unpack the actual array */ - MPI_Unpack(buffer, buffer_size, &position, &int_data_count, 1, MPI_INT, MPI_COMM_WORLD); - int *int_data = malloc(int_data_count * sizeof(int)); - MPI_Unpack(buffer, buffer_size, &position, int_data, int_data_count, MPI_INT, MPI_COMM_WORLD); - - MPI_Unpack(buffer, buffer_size, &position, &float_data_count, 1, MPI_INT, MPI_COMM_WORLD); - float *float_data = malloc(float_data_count * sizeof(float)); - MPI_Unpack(buffer, buffer_size, &position, float_data, float_data_count, MPI_FLOAT, MPI_COMM_WORLD); - - printf("int data: ["); - for (int i = 0; i < int_data_count; ++i) { - printf(" %d", int_data[i]); - } - printf(" ]\n"); - - printf("float data: ["); - for (int i = 0; i < float_data_count; ++i) { - printf(" %f", float_data[i]); - } - printf(" ]\n"); - - free(int_data); - free(float_data); - free(buffer); - } - - return MPI_Finalize(); -} -``` - -:::: -::::: diff --git a/high_performance_computing/hpc_mpi/09_porting_serial_to_mpi.md b/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md similarity index 77% rename from high_performance_computing/hpc_mpi/09_porting_serial_to_mpi.md rename to high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md index c64d3b94..42f924aa 100644 --- a/high_performance_computing/hpc_mpi/09_porting_serial_to_mpi.md +++ b/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md @@ -1,9 +1,7 @@ --- name: Porting Serial Code to MPI -dependsOn: [ - high_performance_computing.hpc_mpi.08_communication_patterns -] -tags: [] +dependsOn: [high_performance_computing.hpc_mpi.07-derived-data-types] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -14,23 +12,19 @@ learningOutcomes: - Identify which parts of a codebase would benefit from parallelisation, and those that need to be done serially or only once. - Convert a serial scientific code into a parallel code. - Differentiate between choices of communication pattern and algorithm design. - --- In this section we will look at converting a complete code from serial to parallel in a number of steps. ## An Example Iterative Poisson Solver -This episode is based on a code that solves the Poisson's equation using an iterative method. -Poisson's equation appears in almost every field in physics, and is frequently used to model many physical phenomena such as heat conduction, and applications of this equation exist for both two and three dimensions. -In this case, the equation is used in a simplified form to describe how heat diffuses in a one dimensional metal stick. +TThis episode is based on a code that solves the Poisson's equation using an iterative method. Poisson's equation appears in almost every field in physics, and is frequently used to model many physical phenomena such as heat conduction, and applications of this equation exist for both two and three dimensions. In this case, the equation is used in a simplified form to describe how heat diffuses in a one dimensional metal stick. In the simulation the stick is split into a given number of slices, each with a constant temperature. ![Stick divided into separate slices with touching boundaries at each end](fig/poisson_stick.png) -The temperature of the stick itself across each slice is initially set to zero, whilst at one boundary of the stick the amount of heat is set to 10. -The code applies steps that simulates heat transfer along it, bringing each slice of the stick closer to a solution until it reaches a desired equilibrium in temperature along the whole stick. +The temperature of the stick itself across each slice is initially set to zero, whilst at one boundary of the stick the amount of heat is set to 10. The code applies steps that simulates heat transfer along it, bringing each slice of the stick closer to a solution until it reaches a desired equilibrium in temperature along the whole stick. Let's download the code, which can be found [here](code/examples/poisson/poisson.c), and take a look at it now. @@ -44,7 +38,7 @@ We'll begin by looking at the `main()` function at a high level. ... -int main(int argc, char** argv) { +int main(int argc, char **argv) { // The heat energy in each block float *u, *unew, *rho; @@ -75,11 +69,11 @@ The next step is to initialise the initial conditions of the simulation: ```c // Set up parameters h = 0.1; - hsq = h*h; + hsq = h * h; residual = 1e-5; // Initialise the u and rho field to 0 - for (i = 0; i <= GRIDSIZE+1; i++) { + for (i = 0; i <= GRIDSIZE + 1; ++i) { u[i] = 0.0; rho[i] = 0.0; } @@ -100,9 +94,11 @@ Next, the code iteratively calls `poisson_step()` to calculate the next set of r ```c // Run iterations until the field reaches an equilibrium // and no longer changes - for (i = 0; i < NUM_ITERATIONS; i++) { + for (i = 0; i < NUM_ITERATIONS; ++i) { unorm = poisson_step(u, unew, rho, hsq, GRIDSIZE); - if (sqrt(unorm) < sqrt(residual)) break; + if (sqrt(unorm) < sqrt(residual)) { + break; + } } ``` @@ -110,7 +106,7 @@ Finally, just for show, the code outputs a representation of the result - the en ```c printf("Final result:\n"); - for (int j = 1; j <= GRIDSIZE; j++) { + for (int j = 1; j <= GRIDSIZE; ++j) { printf("%d-", (int) u[j]); } printf("\n"); @@ -124,9 +120,9 @@ The `poisson_step()` progresses the simulation by a single step. After it accepts its arguments, for each slice in the stick it calculates a new value based on the temperatures of its neighbours: ```c - for (int i = 1; i <= points; i++) { + for (int i = 1; i <= points; ++i) { float difference = u[i-1] + u[i+1]; - unew[i] = 0.5 * (difference - hsq*rho[i]); + unew[i] = 0.5 * (difference - hsq * rho[i]); } ``` @@ -134,10 +130,9 @@ Next, it calculates a value representing the overall cumulative change in temper ```c unorm = 0.0; - for (int i = 1; i <= points; i++) { - - float diff = unew[i]-u[i]; - unorm += diff*diff; + for (int i = 1; i <= points; ++i) { + float diff = unew[i] - u[i]; + unorm += diff * diff; } ``` @@ -145,8 +140,7 @@ And finally, the state of the stick is set to the newly calculated values, and ` ```c // Overwrite u with the new field - for (int i = 1; i <= points; i++) { - + for (int i = 1 ;i <= points; ++i) { u[i] = unew[i]; } @@ -159,7 +153,7 @@ And finally, the state of the stick is set to the newly calculated values, and ` You may compile and run the code as follows: ```bash -gcc poisson.c -o poisson +gcc poisson.c -o poisson -lm ./poisson ``` @@ -171,8 +165,7 @@ Final result: Run completed in 182 iterations with residue 9.60328e-06 ``` -Here, we can see a basic representation of the temperature of each slice of the stick at the end of the simulation, and how the initial `10.0` temperature applied at the beginning of the stick has transferred along it to this final state. -Ordinarily, we might output the full sequence to a file, but we've simplified it for convenience here. +Here, we can see a basic representation of the temperature of each slice of the stick at the end of the simulation, and how the initial `10.0` temperature applied at the beginning of the stick has transferred along it to this final state. Ordinarily, we might output the full sequence to a file, but we've simplified it for convenience here. ::::callout{variant="warning"} @@ -190,12 +183,10 @@ gcc -poisson.c -o poisson -lm ## Approaching Parallelism -So how should we make use of an MPI approach to parallelise this code? -A good place to start is to consider the nature of the data within this computation, and what we need to achieve. +So how should we make use of an MPI approach to parallelise this code? A good place to start is to consider the nature of the data within this computation, and what we need to achieve. For a number of iterative steps, currently the code computes the next set of values for the entire stick. -So at a high level one approach using MPI would be to split this computation by dividing the stick into sections each with a number of slices, and have a separate rank responsible for computing iterations for those slices within its given section. -Essentially then, for simplicity we may consider each section a stick on its own, with either two neighbours at touching boundaries (for middle sections of the stick), or one touching boundary neighbour (for sections at the beginning and end of the stick, which also have either a start or end stick boundary touching them). For example, considering a `GRIDSIZE` of 12 and three ranks: +So at a high level one approach using MPI would be to split this computation by dividing the stick into sections each with a number of slices, and have a separate rank responsible for computing iterations for those slices within its given section. Essentially then, for simplicity we may consider each section a stick on its own, with either two neighbours at touching boundaries (for middle sections of the stick), or one touching boundary neighbour (for sections at the beginning and end of the stick, which also have either a start or end stick boundary touching them). For example, considering a `GRIDSIZE` of 12 and three ranks: ![Stick divisions subdivided across ranks](fig/poisson_stick_subdivided.png) @@ -302,7 +293,7 @@ Since we're not initialising for the entire stick (`GRIDSIZE`) but just the sect ```c // Initialise the u and rho field to 0 - for (i = 0; i <= rank_gridsize+1; i++) { + for (i = 0; i <= rank_gridsize + 1; ++i) { u[i] = 0.0; rho[i] = 0.0; } @@ -317,8 +308,7 @@ As we found out in the *Serial Regions* exercise, we need to ensure that only a u[0] = 10.0; ``` -We also need to collect the overall results from all ranks and output that collected result, but again, only for rank zero. -To collect the results from all ranks (held in `u`) we can use `MPI_Gather()`, to send all `u` results to rank zero to hold in a results array. +We also need to collect the overall results from all ranks and output that collected result, but again, only for rank zero. To collect the results from all ranks (held in `u`) we can use `MPI_Gather()`, to send all `u` results to rank zero to hold in a results array. Note that this will also include the result from rank zero! Add the following to the list of declarations at the start of `main()`: @@ -334,7 +324,7 @@ Then before `MPI_Finalize()` let's amend the code to the following: // We need to send data starting from the second element of u, since u[0] is a boundary resultbuf = malloc(sizeof(*resultbuf) * GRIDSIZE); MPI_Gather(&u[1], rank_gridsize, MPI_FLOAT, resultbuf, rank_gridsize, MPI_FLOAT, 0, MPI_COMM_WORLD); - + if (rank == 0) { printf("Final result:\n"); for (int j = 0; j < GRIDSIZE; j++) { @@ -375,23 +365,17 @@ double poisson_step( ### `poisson_step()`: Calculating a Global `unorm` We know from `Parallelism and Data Exchange` that we need to calculate `unorm` across all ranks. -We already have it calculated for separate ranks, so need to *reduce* those values in an MPI sense to a single summed value. -For this, we can use `MPI_Allreduce()`. +We already have it calculated for separate ranks, so need to *reduce* those values in an MPI sense to a single summed value. For this, we can use `MPI_Allreduce()`. Insert the following into the `poisson_step()` function, putting the declarations at the top of the function: ```c double unorm, global_unorm; -``` - -Then add `MPI_Allreduce()` after the calculation of `unorm`: -```c MPI_Allreduce(&unorm, &global_unorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); ``` -So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value global_unorm`. -We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: +So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value global_unorm`. We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: ```c return global_unorm; @@ -421,38 +405,33 @@ So following the `MPI_Allreduce()` we've just added, let's deal with odd ranks f // The u field has been changed, communicate it to neighbours // With blocking communication, half the ranks should send first // and the other half should receive first - if ((rank%2) == 1) { + if ((rank % 2) == 1) { // Ranks with odd number send first - // Send data down from rank to rank-1 + // Send data down from rank to rank - 1 sendbuf = unew[1]; - MPI_Send(&sendbuf, 1, MPI_FLOAT, rank-1, 1, MPI_COMM_WORLD); - // Receive dat from rank-1 - MPI_Recv(&recvbuf, 1, MPI_FLOAT, rank-1, 2, MPI_COMM_WORLD, &mpi_status); + MPI_Send(&sendbuf, 1, MPI_FLOAT, rank - 1, 1, MPI_COMM_WORLD); + // Receive data from rank - 1 + MPI_Recv(&recvbuf, 1, MPI_FLOAT, rank - 1, 2, MPI_COMM_WORLD, &mpi_status); u[0] = recvbuf; - if (rank != (n_ranks-1)) { - // Send data up to rank+1 (if I'm not the last rank) - MPI_Send(&u[points], 1, MPI_FLOAT, rank+1, 1, MPI_COMM_WORLD); - // Receive data from rank+1 - MPI_Recv(&u[points+1], 1, MPI_FLOAT, rank+1, 2, MPI_COMM_WORLD, &mpi_status); + if (rank != (n_ranks - 1)) { + // Send data up to rank + 1 (if I'm not the last rank) + MPI_Send(&u[points], 1, MPI_FLOAT, rank + 1, 1, MPI_COMM_WORLD); + // Receive data from rank + 1 + MPI_Recv(&u[points + 1], 1, MPI_FLOAT, rank + 1, 2, MPI_COMM_WORLD, &mpi_status); } ``` -Here we use C's inbuilt modulus operator (`%`) to determine if the rank is odd. -If so, we exchange some data with the rank preceding us, and the one following. +Here we use C's inbuilt modulus operator (`%`) to determine if the rank is odd. If so, we exchange some data with the rank preceding us, and the one following. -We first send our newly computed leftmost value (at position `1` in our array) to the rank preceding us. -Since we're an odd rank, we can always assume a rank preceding us exists, -since the earliest odd rank 1 will exchange with rank 0. -Then, we receive the rightmost boundary value from that rank. +We first send our newly computed leftmost value (at position `1` in our array) to the rank preceding us. Since we're an odd rank, we can always assume a rank preceding us exists, since the earliest odd rank 1 will exchange with rank 0. Then, we receive the rightmost boundary value from that rank. -Then, if the rank following us exists, we do the same, but this time we send the rightmost value at the end of our stick section, -and receive the corresponding value from that rank. +Then, if the rank following us exists, we do the same, but this time we send the rightmost value at the end of our stick section, and receive the corresponding value from that rank. These exchanges mean that - as an even rank - we now have effectively exchanged the states of the start and end of our slices with our respective neighbours. -And now we need to do the same for those neighbours (the even ranks), mirroring the same communication pattern but in the opposite order of receive/send. -From the perspective of evens, it should look like the following (highlighting the two even ranks): + +And now we need to do the same for those neighbours (the even ranks), mirroring the same communication pattern but in the opposite order of receive/send. From the perspective of evens, it should look like the following (highlighting the two even ranks): ![Communication strategy - even ranks first receive from odd ranks, then send to them](fig/poisson_comm_strategy_2.png) @@ -480,7 +459,8 @@ Once complete across all ranks, every rank will then have the slice boundary dat ### Running our Parallel Code -Now we have the parallelised code in place, we can compile and run it, e.g.: +You can obtain a full version of the parallelised Poisson code from [here](code/examples/poisson/poisson_mpi.c). Now we have the parallelised code in place, we can compile and run it, e.g.: + ```bash mpicc poisson_mpi.c -o poisson_mpi @@ -505,45 +485,18 @@ So we should test once we have an initial MPI version, and as our code develops, :::::challenge{id=an-initial-test, title="An Initial Test"} Test the MPI version of your code against the serial version, using 1, 2, 3, and 4 ranks with the MPI version. Are the results as you would expect? -What happens if you test with 5 ranks, and why? Write a simple test into the code that would catch the error using the `assert(condition)` function from the `assert.h` library, which will terminate the program if `condition` evalutes to `false`. +What happens if you test with 5 ranks, and why? ::::solution Using these ranks, the MPI results should be the same as our serial version. Using 5 ranks, our MPI version yields `9-8-7-6-5-4-3-2-1-0-0-0-` which is incorrect. -This is because the `rank_gridsize = GRIDSIZE / n_ranks` calculation becomes `rank_gridsize = 12 / 5`, which produces 2.4. -This is then converted to the integer 2, which means each of the 5 ranks is only operating on 2 slices each, for a total of 10 slices. -This doesn't fill `resultbuf` with results representing an expected `GRIDSIZE` of 12, hence the incorrect answer. - -This highlights another aspect of complexity we need to take into account when writing such parallel implementations, where we must ensure a problem space is correctly subdivided. We especially want to prevent situations where the code *appears* to run without a crash or error, but still gives a completely wrong answer. - -We can catch the error using an assertion by importing the `assert.h` library at the top of the file: - -```c -#include -``` - -Then we can add the check itself just after calculating `rank_gridsize`, where we test to see if the gridsize calculation would have left a non-zero remainder. This means there are cells that can't be evenly distributed across the ranks. - -If we don't add any conditions, we'll get one error message per rank, so we want to condition it to only run on a single one: - -```c - // Test that the grid can be subdivided between the ranks properly - if (rank == 0) { - assert(GRIDSIZE % n_ranks == 0); - } -``` - -This should give us a helpful error when we try to run the code for an invalid number of ranks, instead of simply giving us the wrong answer at the end: - -```text -poisson_mpi: poisson_mpi.c:105: main: Assertion `GRIDSIZE % n_ranks == 0' failed. -``` +This is because the `rank_gridsize = GRIDSIZE / n_ranks` calculation becomes `rank_gridsize = 12 / 5`, which produces 2.4. This is then converted to the integer 2, which means each of the 5 ranks is only operating on 2 slices each, for a total of 10 slices. This doesn't fill `resultbuf` with results representing an expected `GRIDSIZE` of 12, hence the incorrect answer. -If we wanted our code to be more flexible, we could implement a more careful way to subdivide the slices across the ranks, with some ranks obtaining more slices to make up the shortfall correctly. +This highlights another aspect of complexity we need to take into account when writing such parallel implementations, where we must ensure a problem space is correctly subdivided. In this case, we could implement a more careful way to subdivide the slices across the ranks, with some ranks obtaining more slices to make up the shortfall correctly. :::: ::::: -:::::challenge{id=limitations, title="Limitations"} +:::::challenge{id=limitations, title="Limitations!"} You may remember that for the purposes of this episode we've assumed a homogeneous stick, by setting the `rho` coefficient to zero for every slice. As a thought experiment, if we wanted to address this limitation and model an inhomogeneous stick with different static coefficients for each slice, how could we amend our code to allow this correctly for each slice across all ranks? diff --git a/high_performance_computing/hpc_mpi/10_optimising_mpi.md b/high_performance_computing/hpc_mpi/09_optimising_mpi.md similarity index 91% rename from high_performance_computing/hpc_mpi/10_optimising_mpi.md rename to high_performance_computing/hpc_mpi/09_optimising_mpi.md index 521814ad..4802af2e 100644 --- a/high_performance_computing/hpc_mpi/10_optimising_mpi.md +++ b/high_performance_computing/hpc_mpi/09_optimising_mpi.md @@ -1,10 +1,7 @@ --- name: Optimising MPI Applications -dependsOn: [ - high_performance_computing.hpc_mpi.09_porting_serial_to_mpi, - high_performance_computing.hpc_intro -] -tags: [] +dependsOn: [high_performance_computing.hpc_intro, high_performance_computing.hpc_mpi.08_porting_serial_to_mpi] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -15,8 +12,8 @@ learningOutcomes: - Describe and differentiate between strong and weak scaling. - Test the strong and weak scaling performance of our MPI code. - Use a profiler to identify performance characteristics of our MPI application. - --- + Now we have parallelised our code, we should determine how well it performs. Given the various ways code can be parallellised, the underlying scientific implementation,and the type and amount of data the code is expected to process, the performance of different parallelised code can vary hugely under different circumstances, @@ -57,14 +54,14 @@ as the program will always take at least the length of the serial portion. ### Amdahl's Law and Strong Scaling -There is a theoretical limit in what parallelization can achieve, and it is encapsulated in "Amdahl's Law": +There is a theoretical limit in what parallelisation can achieve, and it is encapsulated in "Amdahl's Law": $$ \mathrm{speedup} = 1 / (s + p / N) $$ Where: - $$s$$ is the proportion of execution time spent on the serial part -- $$p$$ is the proportion of execution time spent on the part that can be parallelized +- $$p$$ is the proportion of execution time spent on the part that can be parallelised - $$N$$ is the number of processors Amdahl’s law states that, for a fixed problem, the upper limit of speedup is determined by the serial fraction of the code - most real world applications have some serial portion or unintended delays (such as communication overheads) which will limit the code’s scalability. @@ -77,7 +74,7 @@ Amdahl’s law states that, for a fixed problem, the upper limit of speedup is d ## Amdahl's Law in Practice Consider a program that takes 20 hours to run using one core. -If a particular part of the rogram, which takes one hour to execute, cannot be parallelized (s = 1/20 = 0.05), and if the code that takes up the remaining 19 hours of execution time can be parallelized (p = 1 − s = 0.95), then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. +If a particular part of the rogram, which takes one hour to execute, cannot be parallelised (s = 1/20 = 0.05), and if the code that takes up the remaining 19 hours of execution time can be parallelised (p = 1 − s = 0.95), then regardless of how many processors are devoted to a parallelised execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (when N = ∞, speedup = 1/s = 20). :::: @@ -187,12 +184,9 @@ The increase in runtime is probably due to the memory bandwidth of the node bein The other significant factors in the speed of a parallel program are communication speed and latency. -Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying hardware for the communication. -Latency consists of the software latency (how long the operating system needs in order to prepare for a communication), and the hardware latency (how long the hardware takes to -send/receive even a small bit of data). +Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying hardware for the communication. Latency consists of the software latency (how long the operating system needs in order to prepare for a communication), and the hardware latency (how long the hardware takes to send/receive even a small bit of data). -For a fixed-size problem, the time spent in communication is not significant when the number of ranks is small and the execution of parallel regions gets faster with the number of ranks. -But if we keep increasing the number of ranks, the time spent in communication grows when multiple cores are involved with communication. +For a fixed-size problem, the time spent in communication is not significant when the number of ranks is small and the execution of parallel regions gets faster with the number of ranks. But if we keep increasing the number of ranks, the time spent in communication grows when multiple cores are involved with communication. ### Surface-to-Volume Ratio @@ -202,7 +196,7 @@ The whole data which a CPU or a core computes is the sum of the two. The data un The surface data requires communications. he more surface there is, the more communications among CPUs/cores is needed, and the longer the program will take to finish. -Due to Amdahl's law, you want to minimize the number of communications for the same surface since each communication takes finite amount of time to prepare (latency). +Due to Amdahl's law, you want to minimise the number of communications for the same surface since each communication takes finite amount of time to prepare (latency). This suggests that the surface data be exchanged in one communication if possible, not small parts of the surface data exchanged in multiple communications. Of course, sequential consistency should be obeyed when the surface data is exchanged. @@ -212,10 +206,10 @@ Now we have a better understanding of how our code scales with resources and pro But we should be careful! > "We should forget about small efficiencies, say about 97% of the time: -> premature optimization is the root of all evil." -- Donald Knuth +> premature optimisation is the root of all evil." -- Donald Knuth -Essentially, before attempting to optimize your own code, you should profile it. -Typically, most of the runtime is spent in a few functions/subroutines, so you should focus your optimization efforts on those parts of the code. +Essentially, before attempting to optimise your own code, you should profile it. +Typically, most of the runtime is spent in a few functions/subroutines, so you should focus your optimisation efforts on those parts of the code. The good news is that there are helpful tools known as *profilers* that can help us. Profilers help you find out where a program is spending its time and pinpoint places where optimising it makes sense. @@ -258,18 +252,14 @@ Ordinarily when profiling our code using such a tool, it is advisable to create Fortunately that's something we can readily configure with our `poisson_mpi.c` code. For now, set `MAX_ITERATIONS` to `25000` and `GRIDSIZE` to `512`. -We first load the module for Performance Reports. -How you do this will vary site-to-site, but for COSMA on DiRAC we can do: +We first load the module for Performance Reports. Remember that how you do this will vary site-to-site. ```bash module load armforge/23.1.0 module load allinea/ddt/18.1.2 ``` -Next, we run the executable using Performance Reports -to analyse the program execution. -Create a new version of our SLURM submission script we used before, -which includes the following at the bottom of the script instead: +Next, we run the executable using Performance Reports to analyse the program execution. Create a new version of our SLURM submission script we used before, which includes the following at the bottom of the script instead: ```bash module load armforge/23.1.0 @@ -327,16 +317,15 @@ spent in the actual compute sections of the code. :::::challenge{id=profile-poisson, title="Profile Your Poisson Code"} Compile, run and analyse your own MPI version of the poisson code. -How closely does it match the performance above? What are the main differences? -Try reducing the number of processes used, rerun and investigate the profile. -Is it still MPI-bound? +How closely does it match the performance above? What are the main differences? +Try reducing the number of processes used, rerun and investigate the profile. Is it still MPI-bound? Increase the problem size, recompile, rerun and investigate the profile. What has changed now? ::::: :::::challenge{id=iterative-improvement, title="Iterative Improvement"} -In the Poisson code, try changing the location of the calls to `MPI_Send`. How does this affect performance? +In the Poisson code, try changing the location of the calls to `MPI_Send()`. How does this affect performance? ::::: ::::callout{variant="tip"} diff --git a/high_performance_computing/hpc_mpi/08_communication_patterns.md b/high_performance_computing/hpc_mpi/10_communication_patterns.md similarity index 75% rename from high_performance_computing/hpc_mpi/08_communication_patterns.md rename to high_performance_computing/hpc_mpi/10_communication_patterns.md index 38abef19..d35d4bcf 100644 --- a/high_performance_computing/hpc_mpi/08_communication_patterns.md +++ b/high_performance_computing/hpc_mpi/10_communication_patterns.md @@ -1,9 +1,7 @@ --- -name: A Common Communication Patterns -dependsOn: [ - high_performance_computing.hpc_mpi.07_advanced_communication -] -tags: [] +name: Common Communication Patterns +dependsOn: [high_performance_computing.hpc_mpi.09_optimising_mpi] +tags: [mpi] attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -13,7 +11,6 @@ attribution: learningOutcomes: - Learn and understand common communication patterns in MPI programs. - Determine what communication pattern you should use for your own MPI applications. - --- We have now come across the basic building blocks we need to create an MPI application. @@ -52,28 +49,28 @@ Each row in the resulting matrix depends on a single row in matrix A, and each c To split the calculation across ranks, one approach would be to *scatter* rows from matrix A and calculate the result for that scattered data and to combine the results from each rank to get the full result. ```c -/* Determine how many rows each matrix will compute and allocate space for a receive buffer - receive scattered subsets from root rank. We'll use 1D arrays to store the matrices, as it - makes life easier when using scatter and gather */ +// Determine how many rows each matrix will compute and allocate space for a receive buffer +// receive scattered subsets from root rank. We'll use 1D arrays to store the matrices, as it +// makes life easier when using scatter and gather int rows_per_rank = num_rows_a / num_ranks; double *rank_matrix_a = malloc(rows_per_rank * num_rows_a * sizeof(double)); double *rank_matrix_result = malloc(rows_per_rank * num_cols_b * sizeof(double)); -/* Scatter matrix_a across ranks into rank_matrix_a. Each rank will compute a subset of - the result for the rows in rank_matrix_a */ -MPI_Scatter(matrix_a, rows_per_ranks * num_cols_a, MPI_DOUBLE, rank_matrix_a, rows_per_ranks * num_cols_a, +// Scatter matrix_a across ranks into rank_matrix_a. Each rank will compute a subset of +// the result for the rows in rank_matrix_a +MPI_Scatter(matrix_a, rows_per_rank * num_cols_a, MPI_DOUBLE, rank_matrix_a, rows_per_rank * num_cols_a, MPI_DOUBLE, ROOT_RANK, MPI_COMM_WORLD); -/* Broadcast matrix_b to all ranks, because matrix_b was only created on the root rank - and each sub-calculation needs to know all elements in matrix_b */ +// Broadcast matrix_b to all ranks, because matrix_b was only created on the root rank +// and each sub-calculation needs to know all elements in matrix_b MPI_Bcast(matrix_b, num_rows_b * num_cols_b, MPI_DOUBLE, ROOT_RANK, MPI_COMM_WORLD); -/* Function to compute result for the subset of rows of matrix_a */ +// Function to compute result for the subset of rows of matrix_a multiply_matrices(rank_matrix_a, matrix_b, rank_matrix_result); -/* Use gather communication to get each rank's result for rank_matrix_a * matrix_b into receive - buffer `matrix_result`. Our life is made easier since rank_matrix and matrix_result are flat (and contiguous) - arrays, so we don't need to worry about memory layout*/ +// Use gather communication to get each rank's result for rank_matrix_a * matrix_b into receive +// buffer `matrix_result`. Our life is made easier since rank_matrix and matrix_result are flat (and contiguous) +// arrays, so we don't need to worry about memory layout MPI_Gather(rank_matrix_result, rows_per_rank * num_cols_b, MPI_DOUBLE, matrix_result, rows_per_rank * num_cols_b, MPI_DOUBLE, ROOT_RANK, MPI_COMM_WORLD); ``` @@ -96,36 +93,36 @@ Since each point generated and its position within the circle is completely inde To parallelise the problem, each rank generates a sub-set of the total number of points and a reduction is done at the end, to calculate the total number of points within the circle from the entire sample. ```c -/* 1 billion points is a lot, so we should parallelise this calculation */ +// 1 billion points is a lot, so we should parallelise this calculation int total_num_points = (int)1e9; -/* Each rank will check an equal number of points, with their own - counter to track the number of points falling within the circle */ +// Each rank will check an equal number of points, with their own +// counter to track the number of points falling within the circle int points_per_rank = total_num_points / num_ranks; int rank_points_in_circle = 0; -/* Seed each rank's RNG with a unique seed, otherwise each rank will have an - identical result and it would be the same as using `points_per_rank` in total - rather than `total_num_points` */ +// Seed each rank's RNG with a unique seed, otherwise each rank will have an +// identical result and it would be the same as using `points_per_rank` in total +// rather than `total_num_points` srand(time(NULL) + my_rank); -/* Generate a random x and y coordinate (between 0 - 1) and check to see if that - point lies within the unit circle */ +// Generate a random x and y coordinate (between 0 - 1) and check to see if that +// point lies within the unit circle for (int i = 0; i < points_per_rank; ++i) { double x = (double)rand() / RAND_MAX; double y = (double)rand() / RAND_MAX; - if (x * x + y * y <= 1.0) { - rank_points_in_circle++; /* It's in the circle, so increment the counter */ + if ((x * x) + (y * y) <= 1.0) { + rank_points_in_circle++; // It's in the circle, so increment the counter } } -/* Perform a reduction to sum up `rank_points_in_circle` across all ranks, this - will be the total number of points in a circle for `total_num_point` iterations */ +// Perform a reduction to sum up `rank_points_in_circle` across all ranks, this +// will be the total number of points in a circle for `total_num_point` iterations int total_points_in_circle; MPI_Reduce(&rank_points_in_circle, &total_points_in_circle, 1, MPI_INT, MPI_SUM, ROOT_RANK, MPI_COMM_WORLD); -/* The estimate for π is proportional to the ratio of the points in the circle and the number of - points generated */ +//The estimate for π is proportional to the ratio of the points in the circle and the number of +// points generated if (my_rank == ROOT_RANK) { double pi = 4.0 * total_points_in_circle / total_num_points; printf("Estimated value of π = %f\n", pi); @@ -180,17 +177,17 @@ This is sometimes a better approach, as it allows for more efficient and balance An example of 2D domain decomposition is shown in the next example, which uses a derived type (from the previous episode) to discretise the image into smaller rectangles and to scatter the smaller sub-domains to the other ranks. ```c -/* We have to first calculate the size of each rectangular region. In this example, we have - assumed that the dimensions are perfectly divisible. We can determine the dimensions for the - decomposition by using MPI_Dims_create() */ +// We have to first calculate the size of each rectangular region. In this example, we have +// assumed that the dimensions are perfectly divisible. We can determine the dimensions for the +// decomposition by using MPI_Dims_create() int rank_dims[2] = { 0, 0 }; MPI_Dims_create(num_ranks, 2, rank_dims); int num_rows_per_rank = num_rows / rank_dims[0]; int num_cols_per_rank = num_cols / rank_dims[1]; int num_elements_per_rank = num_rows_per_rank * num_cols_per_rank; -/* The rectangular blocks we create are not contiguous in memory, so we have to use a - derived data type for communication */ +// The rectangular blocks we create are not contiguous in memory, so we have to use a +// derived data type for communication MPI_Datatype sub_array_t; int count = num_rows_per_rank; int blocklength = num_cols_per_rank; @@ -198,8 +195,8 @@ int stride = num_cols; MPI_Type_vector(count, blocklength, stride, MPI_DOUBLE, &sub_array_t); MPI_Type_commit(&sub_array_t); -/* MPI_Scatter (and similar collective functions) do not work well with this sort of - topology, so we unfortunately have to scatter the array manually */ +// MPI_Scatter (and similar collective functions) do not work well with this sort of +// topology, so we unfortunately have to scatter the array manually double *rank_image = malloc(num_elements_per_rank * sizeof(double)); scatter_sub_arrays_to_other_ranks(image, rank_image, sub_array_t, rank_dims, my_rank, num_rows_per_rank, num_cols_per_rank, num_elements_per_rank, num_cols); @@ -213,65 +210,56 @@ As mentioned in the previous code example, distributing the 2D sub-domains acros Therefore, we have to transfer the data manually using point-to-point communication. An example of how can be done is shown below. ```c -/* Function to convert row and col coordinates into an index for a 1d array */ +// Function to convert row and col coordinates into an index for a 1d array int index_into_2d(int row, int col, int num_cols) { return row * num_cols + col; } -/* Fairly complex function to send sub-arrays of `image` to the other ranks */ -void scatter_sub_arrays_to_other_ranks( - double *image, double *rank_image, MPI_Datatype sub_array_t, int rank_dims[2], - int my_rank, int num_cols_per_rank, int num_rows_per_rank, - int num_elements_per_rank, int num_cols -) +// Fairly complex function to send sub-arrays of `image` to the other ranks +void scatter_sub_arrays_to_other_ranks(double *image, double *rank_image, MPI_Datatype sub_array_t, int rank_dims[2], + int my_rank, int num_cols_per_rank, int num_rows_per_rank, + int num_elements_per_rank, int num_cols) { - if (my_rank == ROOT_RANK) { - int dest_rank = 0; - for (int i = 0; i < rank_dims[0]) { - for (int j = 0; j < rank_dims[1]) { - - /* Send sub array to a non-root rank */ - if(dest_rank != ROOT_RANK) { - MPI_Send( - &image[index_into_2d(num_rows_per_rank * i, num_cols_per_rank * j, num_cols)], 1, sub_array_t, - dest_rank, 0, MPI_COMM_WORLD - ); - - /* Copy into root rank's rank image buffer */ - } else { - for (int ii = 0; ii < num_rows_per_rank; ++ii) { - for (int jj = 0; jj < num_cols_per_rank; ++jj) { - rank_image[index_into_2d(ii, jj, num_cols_per_rank)] = image[index_into_2d(ii, jj, num_cols)]; - } - } + if (my_rank == ROOT_RANK) { + int dest_rank = 0; + for (int i = 0; i < rank_dims[0]) { + for (int j = 0; j < rank_dims[1]) { + // Send sub array to a non-root rank + if(dest_rank != ROOT_RANK) { + MPI_Send(&image[index_into_2d(num_rows_per_rank * i, num_cols_per_rank * j, num_cols)], 1, sub_array_t, + dest_rank, 0, MPI_COMM_WORLD); + // Copy into root rank's rank image buffer + } else { + for (int ii = 0; ii < num_rows_per_rank; ++ii) { + for (int jj = 0; jj < num_cols_per_rank; ++jj) { + rank_image[index_into_2d(ii, jj, num_cols_per_rank)] = image[index_into_2d(ii, jj, num_cols)]; + } + } + } + dest_rank += 1; + } } - dest_rank += 1; - } - } - } else { - MPI_Recv(rank_image, num_elements_per_rank, MPI_DOUBLE, ROOT_RANK, MPI_COMM_WORLD, MPI_STATUS_IGNORE); - } + } else { + MPI_Recv(rank_image, num_elements_per_rank, MPI_DOUBLE, ROOT_RANK, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + } } ``` - :::: The function [`MPI_Dims_create()`](https://www.open-mpi.org/doc/v4.1/man3/MPI_Dims_create.3.php) is a useful utility function in MPI which is used to determine the dimensions of a Cartesian grid of ranks. In the above example, it's used to determine the number of rows and columns in each sub-array, given the number of ranks in the row and column directions of the grid of ranks from `MPI_Dims_create()`. -In addition to the code above, you may also want to create a [*virtual Cartesian communicator topology*](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node187.htm#Node187) to reflect the decomposed geometry in the communicator as well, as this give access to a number of other utility functions which makes communicating data easier. +In addition to the code above, you may also want to create a +[*virtual Cartesian communicator topology*](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node187.htm#Node187) to reflect the decomposed geometry in the communicator as well, as this give access to a number of other utility functions which makes communicating data easier. ### Halo exchange -In domain decomposition methods, a "halo" refers to a region around the boundary of a sub-domain which contains a copy of the data from neighbouring sub-domains, which needed to perform computations that involve data from adjacent sub-domains. -The halo region allows neighbouring sub-domains to share the required data efficiently, without the need for more than necessary communication. +In domain decomposition methods, a "halo" refers to a region around the boundary of a sub-domain which contains a copy of the data from neighbouring sub-domains, which needed to perform computations that involve data from adjacent sub-domains. The halo region allows neighbouring sub-domains to share the required data efficiently, without the need for more than necessary communication. -In a grid-based domain decomposition, as in the image processing example, a halo is often one, or more, rows of pixels (or grid cells more generally) that surround a sub-domain's "internal" cells. -This is shown in the diagram below. In the diagram, the image has decomposed across two ranks in one direction (1D decomposition). +In a grid-based domain decomposition, as in the image processing example, a halo is often one, or more, rows of pixels (or grid cells more generally) that surround a sub-domain's "internal" cells. This is shown in the diagram below. In the diagram, the image has decomposed across two ranks in one direction (1D decomposition). Each blue region represents the halo for that rank, which has come from the region the respective arrow is point from. ![Depiction of halo exchange for 1D decomposition](fig/halo_example_1d.png) Halos, naturally, increase the memory overhead of the parallelisation as you need to allocate additional space in the array or data structures to account for the halo pixels/cells. -For example, in the above diagram, if the image was discretised into more sub-domains so there are halos on both the left and right side of the sub-domain. -In image processing, a single strip of pixels is usually enough. +For example, in the above diagram, if the image was discretised into more sub-domains so there are halos on both the left and right side of the sub-domain. In image processing, a single strip of pixels is usually enough. If `num_rows` and `num_cols` are the number of rows of pixels containing that many columns of pixels, then each sub-domain has dimensions `num_rows` and `num_cols + 2`. If the decomposition was in two dimensions, then it would be `num_rows + 2` and `num_cols + 2`. In other data structures like graphs or unstructured grids, the halo will be an elements or nodes surrounding the sub-domain. @@ -294,44 +282,58 @@ In 1D domain decomposition, this is a helpful function to use as each rank will ```c int MPI_Sendrecv( - void *sendbuf, /* The data to be sent to `dest` */ - int sendcount, /* The number of elements of data to send to `dest` */ - MPI_Datatype sendtype, /* The data type of the data being sent */ - int dest, /* The rank where data is being sent to */ - int sendtag, /* The send tag */ - void *recvbuf, /* The buffer into which received data will be put into from `source` */ - int recvcount, /* The number of elements of data to receive from `source` */ - MPI_Datatype recvtype, /* The data type of the data being received */ - int source, /* The rank where data is coming from */ - int recvtag, /* The receive tag */ - MPI_Comm comm, /* The communicator containing the ranks */ - MPI_Status *status /* The status for the receive operation */ + void *sendbuf, + int sendcount, + MPI_Datatype sendtype, + int dest, + int sendtag, + void *recvbuf, + int recvcount, + MPI_Datatype recvtype, + int source, + int recvtag, + MPI_Comm comm, + MPI_Status *status ); ``` +| | | +| --- | --- | +| `*sendbuf`: | The data to be sent to `dest` | +| `sendcount`: | The number of elements of data to be sent to `dest` | +| `sendtype`: | The data type of the data to be sent to `dest` | +| `dest`: | The rank where data is being sent to | +| `sendtag`: | The communication tag for the send | +| `*recvbuf`: | A buffer for data being received | +| `recvcount`: | The number of elements of data to receive | +| `recvtype`: | The data type of the data being received | +| `source`: | The rank where data is coming from | +| `recvtag`: | The communication tag for the receive | +| `comm`: | The communicator | +| `*status`: | The status handle for the receive | :::: ```c -/* Function to convert row and col coordinates into an index for a 1d array */ +// Function to convert row and col coordinates into an index for a 1d array int index_into_2d(int row, int col, int num_cols) { return row * num_cols + col; } -/* `rank_image` is actually a little bigger, as we need two extra rows for a halo region for the top - and bottom of the row sub-domain */ +// `rank_image` is actually a little bigger, as we need two extra rows for a halo region for the top +// and bottom of the row sub-domain double *rank_image = malloc((num_rows + 2) * num_cols * sizeof(double)); -/* MPI_Sendrecv is designed for "chain" communications, so we need to figure out the next - and previous rank. We use `MPI_PROC_NULL` (a special constant) to tell MPI that we don't - have a partner to communicate to/receive from */ +// MPI_Sendrecv is designed for "chain" communications, so we need to figure out the next +// and previous rank. We use `MPI_PROC_NULL` (a special constant) to tell MPI that we don't +// have a partner to communicate to/receive from int prev_rank = my_rank - 1 < 0 ? MPI_PROC_NULL : my_rank - 1; int next_rank = my_rank + 1 > num_ranks - 1 ? MPI_PROC_NULL : my_rank + 1; -/* Send the top row of the image to the bottom row of the previous rank, and receive - the top row from the next rank */ +// Send the top row of the image to the bottom row of the previous rank, and receive +// the top row from the next rank MPI_Sendrecv(&rank_image[index_into_2d(0, 1, num_cols)], num_rows, MPI_DOUBLE, prev_rank, 0, &rank_image[index_into_2d(num_rows - 1, 1, num_cols)], num_rows, MPI_DOUBLE, next_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); -/* Send the bottom row into top row of the next rank, and the reverse from the previous rank */ +// Send the bottom row into top row of the next rank, and the reverse from the previous rank MPI_Sendrecv(&rank_image[index_into_2d(num_rows - 2, 1, num_cols)], num_rows, MPI_DOUBLE, next_rank, 0, &rank_image[index_into_2d(0, 1, num_cols)], num_rows, MPI_DOUBLE, prev_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); diff --git a/high_performance_computing/hpc_mpi/11_advanced_communication.md b/high_performance_computing/hpc_mpi/11_advanced_communication.md new file mode 100644 index 00000000..fb5533b4 --- /dev/null +++ b/high_performance_computing/hpc_mpi/11_advanced_communication.md @@ -0,0 +1,586 @@ +--- +name: Advanced Communication Techniques +dependsOn: [high_performance_computing.hpc_mpi.10_communication_patterns] +tags: [mpi] +attribution: + - citation: > + "Introduction to the Message Passing Interface" course by the Southampton RSG + url: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/ + image: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/assets/img/home-logo.png + license: CC-BY-4.0 +learningOutcomes: + - Understand the problems of non-contiguous memory in MPI. + - Know how to define and use derived datatypes. +--- + +In an earlier episode, we introduced the concept of derived data types to send vectors or a sub-array of a larger array, which may or may not be contiguous in memory. Other than vectors, there are multiple other types of derived data types that allow us to handle other complex data structures efficiently. In this episode, we will see how to create structure derived types. Additionally, we will also learn how to use `MPI_Pack()` and `MPI_Unpack()` to manually pack complex data structures and heterogeneous into a single contiguous buffer, when other methods of communication are too complicated or inefficient. + + +## Structures in MPI + +Structures, commonly known as structs, are custom datatypes which contain multiple variables of (usually) different +types. Some common use cases of structs, in scientific code, include grouping together constants or global variables, or +they are used to represent a physical thing, such as a particle, or something more abstract like a cell on a simulation +grid. When we use structs, we can write clearer, more concise and better structured code. + +To communicate a struct, we need to define a derived datatype which tells MPI about the layout of the struct in memory. +Instead of `MPI_Type_create_vector()`, for a struct, we use, +`MPI_Type_create_struct()`, + +```c +int MPI_Type_create_struct( + int count, + int *array_of_blocklengths, + MPI_Aint *array_of_displacements, + MPI_Datatype *array_of_types, + MPI_Datatype *newtype, +); +``` + +| | | +| --- | --- | +| `count`: | The number of fields in the struct | +| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send()` | +| `*array_of_displacements`: | The relative positions of each field in bytes | +| `*array_of_types`: | The MPI type of each field | +| `*newtype`: | The newly created data type for the struct | + +The main difference between vector and struct derived types is that the arguments for structs expect arrays, since structs are made up of multiple variables. Most of these arguments are straightforward, given what we've just seen for defining vectors. But `array_of_displacements` is new and unique. + +When a struct is created, it occupies a single contiguous block of memory. But there is a catch. For performance reasons, compilers insert arbitrary "padding" between each member for performance reasons. This padding, known as +[data structure alignment](https://en.wikipedia.org/wiki/Data_structure_alignment), optimises both the layout of the memory and the access of it. As a result, the memory layout of a struct may look like this instead: + +![Memory layout for a struct](fig/struct_memory_layout.png) + +Although the memory used for padding and the struct's data exists in a contiguous block, the actual data we care about is not contiguous any more. This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the stride in vectors. + +To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. To do this, we can use the function `MPI_Get_address()`, + +```c +int MPI_Get_address{ + const void *location, + MPI_Aint *address, +}; +``` + +| | | +| --- | --- | +| `*location`: | A pointer to the variable we want the address of | +| `*address`: | The address of the variable, as an MPI_Aint (address integer) | + +In the following example, we use `MPI_Type_create_struct()` and `MPI_Get_address()` to create a derived type for a struct with two members, + +```c +// Define and initialize a struct, named foo, with an int and a double +struct MyStruct { + int id; + double value; +} foo = {.id = 0, .value = 3.1459}; + +// Create arrays to describe the length of each member and their type +int count = 2; +int block_lengths[2] = {1, 1}; +MPI_Datatype block_types[2] = {MPI_INT, MPI_DOUBLE}; + +// Now we calculate the displacement of each member, which are stored in an +// MPI_Aint designed for storing memory addresses +MPI_Aint base_address; +MPI_Aint block_offsets[2]; + +MPI_Get_address(&foo, &base_address); // First of all, we find the address of the start of the struct +MPI_Get_address(&foo.id, &block_offsets[0]); // Now the address of the first member "id" +MPI_Get_address(&foo.value, &block_offsets[1]); // And the second member "value" + +// Calculate the offsets, by subtracting the address of each field from the +// base address of the struct +for (int i = 0; i < 2; ++i) { + // MPI_Aint_diff is a macro to calculate the difference between two + // MPI_Aints and is a replacement for: + // (MPI_Aint) ((char *) block_offsets[i] - (char *) base_address) + block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); +} + +// We finally can create out struct data type +MPI_Datatype struct_type; +MPI_Type_create_struct(count, block_lengths, block_offsets, block_types, &struct_type); +MPI_Type_commit(&struct_type); + +// Another difference between vector and struct derived types is that in +// MPI_Recv, we use the struct type. We have to do this because we aren't +// receiving a contiguous block of a single type of date. By using the type, we +// tell MPI_Recv how to understand the mix of data types and padding and how to +// assign those back to recv_struct +if (my_rank == 0) { + MPI_Send(&foo, 1, struct_type, 1, 0, MPI_COMM_WORLD); +} else { + struct MyStruct recv_struct; + MPI_Recv(&recv_struct, 1, struct_type, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); +} + +// Remember to free the derived type +MPI_Type_free(&struct_type); +``` + +:::::challenge{id=sending-a-struct, title="Sending a Struct"} +By using a derived data type, write a program to send the following struct `struct Node node` from one rank to another: + +```c +struct Node { + int id; + char name[16]; + double temperature; +}; + +struct Node node = { .id = 0, .name = "Dale Cooper", .temperature = 42}; +``` + +You may wish to use [this skeleton code](code/solutions/skeleton-example.c) as your stating point. + +::::solution +Your solution should look something like the code block below. When sending a *static* array (`name[16]`), we have to use a count of 16 in the `block_lengths` array for that member. + +```c +#include +#include + +struct Node { + int id; + char name[16]; + double temperature; +}; + +int main(int argc, char **argv) +{ + int my_rank; + int num_ranks; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); + MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); + + if (num_ranks != 2) { + if (my_rank == 0) { + printf("This example only works with 2 ranks\n"); + } + MPI_Abort(MPI_COMM_WORLD, 1); + } + + struct Node node = {.id = 0, .name = "Dale Cooper", .temperature = 42}; + + int block_lengths[3] = {1, 16, 1}; + MPI_Datatype block_types[3] = {MPI_INT, MPI_CHAR, MPI_DOUBLE}; + + MPI_Aint base_address; + MPI_Aint block_offsets[3]; + MPI_Get_address(&node, &base_address); + MPI_Get_address(&node.id, &block_offsets[0]); + MPI_Get_address(&node.name, &block_offsets[1]); + MPI_Get_address(&node.temperature, &block_offsets[2]); + for (int i = 0; i < 3; ++i) { + block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); + } + + MPI_Datatype node_struct; + MPI_Type_create_struct(3, block_lengths, block_offsets, block_types, &node_struct); + MPI_Type_commit(&node_struct); + + if (my_rank == 0) { + MPI_Send(&node, 1, node_struct, 1, 0, MPI_COMM_WORLD); + } else { + struct Node recv_node; + MPI_Recv(&recv_node, 1, node_struct, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + printf("Received node: id = %d name = %s temperature %f\n", recv_node.id, recv_node.name, + recv_node.temperature); + } + + MPI_Type_free(&node_struct); + + return MPI_Finalize(); +} +``` +:::: +::::: + +:::::challenge{id=what-if-pointers, title="What If I Have a Pointer in My Struct?"} +Suppose we have the following struct with a pointer named `position` and some other fields: + +```c +struct Grid { + double *position; + int num_cells; +}; +grid.position = malloc(3 * sizeof(double)); +``` + +If we use `malloc()` to allocate memory for `position`, how would we send data in the struct and the memory we allocated one rank to another? If you are unsure, try writing a short program to create a derived type for the struct. + +::::solution +The short answer is that we can't do it using a derived type, and will have to *manually* communicate the data separately. The reason why can't use a derived type is because the address of `*position` is the address of the pointer. The offset between `num_cells` and `*position` is the size of the pointer and whatever padding the compiler adds. It is not the data which `position` points to. The memory we allocated for `*position` is somewhere else in memory, as shown in the diagram below, and is non-contiguous with respect to the fields in the struct. + +![Memory layout for a struct with a pointer](fig/struct_with_pointer.png) +:::: +::::: + +## A different way to calculate displacements + +There are other ways to calculate the displacement, other than using what MPI provides for us. +Another common way is to use the `offsetof()` macro part of ``. `offsetof()` accepts two arguments, the first being the struct type and the second being the member to calculate the offset for. + +```c +#include +MPI_Aint displacements[2]; +displacements[0] = (MPI_Aint) offsetof(struct MyStruct, id); // The cast to MPI_Aint is for extra safety +displacements[1] = (MPI_Aint) offsetof(struct MyStruct, value); +``` + +This method and the other shown in the previous examples both returns the same displacement values. +It's mostly a personal choice which you choose to use. +Some people prefer the "safety" of using `MPI_Get_address()` whilst others prefer to write more concise code with `offsetof()`. Of course, if you're a Fortran programmer then you can't use the macro! + +## Complex non-contiguous and heterogeneous data + +The previous two sections covered how to communicate complex but structured data between ranks using derived datatypes. +However, there are *always* some edge cases which don't fit into a derived types. For example, just in the last exercise +we've seen that pointers and derived types don't mix well. Furthermore, we can sometimes also reach performance +bottlenecks when working with heterogeneous data which doesn't fit, or doesn't make sense to be, in a derived type, as +each data type needs to be communicated in separate communication calls. This can be especially bad if blocking +communication is used! For edge cases situations like this, we can use the `MPI_Pack()` and `MPI_Unpack()` functions to +do things ourselves. + +Both `MPI_Pack()` and `MPI_Unpack()` are methods for manually arranging, packing and unpacking data into a contiguous +buffer, for cases where regular communication methods and derived types don't work well or efficiently. They can also be +used to create self-documenting message, where the packed data contains additional elements which describe the size, +structure and contents of the data. But we have to be careful, as using packed buffers comes with additional overhead, +in the form of increased memory usage and potentially more communication overhead as packing and unpacking data is not +free. + +When we use `MPI_Pack()`, we take non-contiguous data (sometimes of different datatypes) and "pack" it into a +contiguous memory buffer. The diagram below shows how two (non-contiguous) chunks of data may be packed into a contiguous +array using `MPI_Pack()`. + +![Layout of packed memory](fig/packed_buffer_layout.png) + +The coloured boxes in both memory representations (memory and pakced) are the same chunks of data. The green boxes +containing only a single number are used to document the number of elements in the block of elements they are adjacent +to, in the contiguous buffer. This is optional to do, but is generally good practise to include to create a +self-documenting message. From the diagram we can see that we have "packed" non-contiguous blocks of memory into a +single contiguous block. We can do this using `MPI_Pack()`. To reverse this action, and "unpack" the buffer, we use +`MPI_Unpack()`. As you might expect, `MPI_Unpack()` takes a buffer, created by `MPI_Pack()` and unpacks the data back +into various memory address. + +To pack data into a contiguous buffer, we have to pack each block of data, one by one, into the contiguous buffer using +the `MPI_Pack()` function, + +```c +int MPI_Pack( + const void *inbuf, + int incount, + MPI_Datatype datatype, + void *outbuf, + int outsize, + int *position, + MPI_Comm comm +); +``` + +| | | +| --- | --- | +| `*inbuf`: | The data to pack into the buffer | +| `incount`: | The number of elements to pack | +| `datatype`: | The data type of the data to pack | +| `*outbuf`: | The out buffer of contiguous data | +| `outsize`: | The size of the out buffer, in bytes | +| `*position`: | A counter for how far into the contiguous buffer to write (records the position, in bytes) | +| `comm`: | The communicator | + +In the above, `inbuf` is the data we want to pack into a contiguous buffer and `incount` and `datatype` define the +number of elements in and the datatype of `inbuf`. The parameter `outbuf` is the contiguous buffer the data is packed +into, with `outsize` being the total size of the buffer in *bytes*. The `position` argument is used to keep track of the +current position, in bytes, where data is being packed into `outbuf`. + +Uniquely, `MPI_Pack()`, and `MPI_Unpack()` as well, measure the size of the contiguous buffer, `outbuf`, in bytes rather than +in number of elements. Given that `MPI_Pack()` is all about manually arranging data, we have to also manage the +allocation of memory for `outbuf`. But how do we allocate memory for it, and how much should we allocate? Allocation is +done by using `malloc()`. Since `MPI_Pack()` works with `outbuf` in terms of bytes, the convention is to declare +`outbuf` as a `char *`. The amount of memory to allocate is simply the amount of space, in bytes, required to store all +of the data we want to pack into it. Just like how we would normally use `malloc()` to create an array. If we had +an integer array and a floating point array which we wanted to pack into the buffer, then the size required is easy to +calculate, + +```c +// The total buffer size is the sum of the bytes required for the int and float array +int size_int_array = num_int_elements * sizeof(int); +int size_float_array = num_float_elements * sizeof(float); +int buffer_size = size_int_array + size_float_array; +// The buffer is a char *, but could also be cast as void * if you prefer +char *buffer = malloc(buffer_size * sizeof(char)); // a char is 1 byte, so sizeof(char) is optional +``` + +If we are also working with derived types, such as vectors or structs, then we need to find the size of those types. By +far the easiest way to handle these is to use `MPI_Pack_size()`, which supports derived datatypes through the +`MPI_Datatype`, + +```c +int MPI_Pack_size( + int incount, + MPI_Datatype datatype, + MPI_Comm comm, + int *size +); +``` +| | | +| --- | --- | +| `incount`: | The number of data elements | +| `datatype`: | The data type of the data | +| `comm`: | The communicator | +| `*size`: | The calculated upper size limit for the buffer, in bytes | + +`MPI_Pack_size()` is a helper function to calculate the *upper bound* of memory required. It is, in general, preferable +to calculate the buffer size using this function, as it takes into account any implementation specific MPI detail and +thus is more portable between implementations and systems. If we wanted to calculate the memory required for three +elements of some derived struct type and a `double` array, we would do the following, + +```c +int struct_array_size, float_array_size; +MPI_Pack_size(3, STRUCT_DERIVED_TYPE, MPI_COMM_WORLD, &struct_array_size); +MPI_Pack_size(50, MPI_DOUBLE. MPI_COMM_WORLD, &float_array_size); +int buffer_size = struct_array_size + float_array_size; +``` + +When a rank has received a contiguous buffer, it has to be unpacked into its constituent parts, one by one, using +`MPI_Unpack()`, + +```c +int MPI_Unpack( + void *inbuf, + int insize, + int *position, + void *outbuf, + int outcount, + MPI_Datatype datatype, + MPI_Comm comm, +); +``` + +| | | +| --- | --- | +| `*inbuf`: | The contiguous buffer to unpack | +| `insize`: | The total size of the buffer, in bytes | +| `*position`: | The position, in bytes, from where to start unpacking from | +| `*outbuf`: | An array, or variable, to unpack data into -- this is the output | +| `outcount`: | The number of elements of data to unpack | +| `datatype`: | The data type of elements to unpack | +| `comm`: | The communicator | + +The arguments for this function are essentially the reverse of `MPI_Pack()`. Instead of being the buffer to pack into, +`inbuf` is now the packed buffer and `position` is the position, in bytes, in the buffer where to unpacking from. +`outbuf` is then the variable we want to unpack into, and `outcount` is the number of elements of `datatype` to unpack. + +In the example below, `MPI_Pack()`, `MPI_Pack_size()` and `MPI_Unpack()` are used to communicate a (non-contiguous) +3 x 3 matrix. + +```c +// Allocate and initialise a (non-contiguous) 2D matrix that we will pack into +// a buffer +int num_rows = 3, num_cols = 3; +int **matrix = malloc(num_rows * sizeof(int *)); +for (int i = 0; i < num_rows; ++i) { + matrix[i] = malloc(num_cols * sizeof(int)); + for (int j = 0; i < num_cols; ++j) { + matrix[i][j] = num_cols * i + j; + } +} + +// Determine the upper limit for the amount of memory the buffer requires. Since +// this is a simple situation, we could probably have done this manually using +// `num_rows * num_cols * sizeof(int)`. The size `pack_buffer_size` is returned in +// bytes +int pack_buffer_size; +MPI_Pack_size(num_rows * num_cols, MPI_INT, MPI_COMM_WORLD, &pack_buffer_size); + +if (my_rank == 0) { + // Create the pack buffer and pack each row of data into it buffer + // one by one + int position = 0; + char *packed_data = malloc(pack_buffer_size); + for (int i = 0; i < num_rows; ++i) { + MPI_Pack(matrix[i], num_cols, MPI_INT, packed_data, pack_buffer_size, &position, MPI_COMM_WORLD); + } + + // Send the packed data to rank 1 + MPI_Send(packed_data, pack_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); +} else { + // Create a receive buffer and get the packed buffer from rank 0 + char *received_data = malloc(pack_buffer_size); + MPI_Recv(received_data, pack_buffer_size + 1, MPI_PACKED, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + + // Allocate a matrix to put the receive buffer into -- this is for demonstration purposes + int **my_matrix = malloc(num_rows * sizeof(int *)); + for (int i = 0; i < num_cols; ++i) { + my_matrix[i] = malloc(num_cols * sizeof(int)); + } + + // Unpack the received data row by row into my_matrix + int position = 0; + for (int i = 0; i < num_rows; ++i) { + MPI_Unpack(received_data, pack_buffer_size, &position, my_matrix[i], num_cols, MPI_INT, MPI_COMM_WORLD); + } +} +``` + +::::callout + +## Blocking or non-blocking? + +The processes of packing data into a contiguous buffer does not happen asynchronously. +The same goes for unpacking data. But this doesn't restrict the packed data from being only sent synchronously. +The packed data can be communicated using any communication function, just like the previous derived types. +It works just as well to communicate the buffer using non-blocking methods, as it does using blocking methods. +:::: + +::::callout + +## What if the other rank doesn't know the size of the buffer? + +In some cases, the receiving rank may not know the size of the buffer used in `MPI_Pack()`. +This could happen if a message is sent and received in different functions, if some ranks have different branches through the program or if communication happens in a dynamic or non-sequential way. + +In these situations, we can use `MPI_Probe()` and `MPI_Get_count` to find the a message being sent and to get the number of elements in the message. + +```c +// First probe for a message, to get the status of it +MPI_Status status; +MPI_Probe(0, 0, MPI_COMM_WORLD, &status); +// Using MPI_Get_count we can get the number of elements of a particular data type +int message_size; +MPI_Get_count(&status, MPI_PACKED, &buffer_size); +// MPI_PACKED represents an element of a "byte stream." So, buffer_size is the size of the buffer to allocate +char *buffer = malloc(buffer_size); +``` +:::: + +:::::challenge{id=heterogeneous-data, title="Sending Heterogeneous Data in a Single Communication"} +Suppose we have two arrays below, where one contains integer data and the other floating point data. +Normally we would use multiple communication calls to send each type of data individually, for a known number of elements. +For this exercise, communicate both arrays using a packed memory buffer. + +```c +int int_data_count = 5; +int float_data_count = 10; + +int *int_data = malloc(int_data_count * sizeof(int)); +float *float_data = malloc(float_data_count * sizeof(float)); + +// Initialize the arrays with some values +for (int i = 0; i < int_data_count; ++i) { + int_data[i] = i + 1; +} +for (int i = 0; i < float_data_count; ++i) { + float_data[i] = 3.14159 * (i + 1); +} +``` + +Since the arrays are dynamically allocated, in rank 0, you should also pack the number of elements in each array. +Rank 1 may also not know the size of the buffer. How would you deal with that? + +You can use this [skeleton code](code/solutions/08-pack-skeleton.c) to begin with. + +::::solution +The additional restrictions for rank 1 not knowing the size of the arrays or packed buffer add some complexity to receiving the packed buffer from rank 0. + +```c +#include +#include +#include + +int main(int argc, char **argv) +{ + int my_rank; + int num_ranks; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); + MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); + + if (num_ranks != 2) { + if (my_rank == 0) { + printf("This example only works with 2 ranks\n"); + } + MPI_Abort(MPI_COMM_WORLD, 1); + } + + if (my_rank == 0) { + int int_data_count = 5, float_data_count = 10; + int *int_data = malloc(int_data_count * sizeof(int)); + float *float_data = malloc(float_data_count * sizeof(float)); + for (int i = 0; i < int_data_count; ++i) { + int_data[i] = i + 1; + } + for (int i = 0; i < float_data_count; ++i) { + float_data[i] = 3.14159 * (i + 1); + } + + // use MPI_Pack_size to determine how big the packed buffer needs to be + int buffer_size_count, buffer_size_int, buffer_size_float; + MPI_Pack_size(2, MPI_INT, MPI_COMM_WORLD, &buffer_size_count); // 2 * INT because we will have 2 counts + MPI_Pack_size(int_data_count, MPI_INT, MPI_COMM_WORLD, &buffer_size_int); + MPI_Pack_size(float_data_count, MPI_FLOAT, MPI_COMM_WORLD, &buffer_size_float); + int total_buffer_size = buffer_size_int + buffer_size_float + buffer_size_count; + + int position = 0; + char *buffer = malloc(total_buffer_size); + + // Pack the data size, followed by the actually data + MPI_Pack(&int_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); + MPI_Pack(int_data, int_data_count, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); + MPI_Pack(&float_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); + MPI_Pack(float_data, float_data_count, MPI_FLOAT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); + + // buffer is sent in one communication using MPI_PACKED + MPI_Send(buffer, total_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); + + free(buffer); + free(int_data); + free(float_data); + } else { + int buffer_size; + MPI_Status status; + MPI_Probe(0, 0, MPI_COMM_WORLD, &status); + MPI_Get_count(&status, MPI_PACKED, &buffer_size); + + char *buffer = malloc(buffer_size); + MPI_Recv(buffer, buffer_size, MPI_PACKED, 0, 0, MPI_COMM_WORLD, &status); + + int position = 0; + int int_data_count, float_data_count; + + // Unpack an integer why defines the size of the integer array, + // then allocate space for an unpack the actual array + MPI_Unpack(buffer, buffer_size, &position, &int_data_count, 1, MPI_INT, MPI_COMM_WORLD); + int *int_data = malloc(int_data_count * sizeof(int)); + MPI_Unpack(buffer, buffer_size, &position, int_data, int_data_count, MPI_INT, MPI_COMM_WORLD); + + MPI_Unpack(buffer, buffer_size, &position, &float_data_count, 1, MPI_INT, MPI_COMM_WORLD); + float *float_data = malloc(float_data_count * sizeof(float)); + MPI_Unpack(buffer, buffer_size, &position, float_data, float_data_count, MPI_FLOAT, MPI_COMM_WORLD); + + printf("int data: ["); + for (int i = 0; i < int_data_count; ++i) { + printf(" %d", int_data[i]); + } + printf(" ]\n"); + + printf("float data: ["); + for (int i = 0; i < float_data_count; ++i) { + printf(" %f", float_data[i]); + } + printf(" ]\n"); + + free(int_data); + free(float_data); + free(buffer); + } + + return MPI_Finalize(); +} +``` +:::: +::::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/code/examples/02-count-primes.c b/high_performance_computing/hpc_mpi/code/examples/02-count-primes.c index 2266ec7e..ba16634a 100644 --- a/high_performance_computing/hpc_mpi/code/examples/02-count-primes.c +++ b/high_performance_computing/hpc_mpi/code/examples/02-count-primes.c @@ -4,7 +4,8 @@ #define NUM_ITERATIONS 100000 -int main (int argc, char *argv[]) { +int main(int argc, char **argv) +{ int my_rank; int num_ranks; @@ -19,8 +20,9 @@ int main (int argc, char *argv[]) { int rank_end = (my_rank + 1) * iterations_per_rank; // catch cases where the work can't be split evenly - if (rank_end > NUM_ITERATIONS || (my_rank == (num_ranks-1) && rank_end < NUM_ITERATIONS)) + if (rank_end > NUM_ITERATIONS || (my_rank == (num_ranks - 1) && rank_end < NUM_ITERATIONS)) { rank_end = NUM_ITERATIONS; + } // each rank is dealing with a subset of the problem int prime_count = 0; @@ -28,8 +30,9 @@ int main (int argc, char *argv[]) { bool is_prime = true; // 0 and 1 are not prime numbers - if (n == 0 || n == 1) + if (n == 0 || n == 1) { is_prime = false; + } // if we can only divide n by i, then n is not prime for (int i = 2; i <= n / 2; ++i) { @@ -39,8 +42,9 @@ int main (int argc, char *argv[]) { } } - if (is_prime) + if (is_prime) { prime_count++; + } } printf("Rank %d - count of primes between %d-%d: %d\n", my_rank, rank_start, rank_end, prime_count); diff --git a/high_performance_computing/hpc_mpi/index.md b/high_performance_computing/hpc_mpi/index.md index 4623b443..089ce466 100644 --- a/high_performance_computing/hpc_mpi/index.md +++ b/high_performance_computing/hpc_mpi/index.md @@ -10,11 +10,12 @@ files: [ 04_point_to_point_communication.md, 05_collective_communication.md, 06_non_blocking_communication.md, - 07_advanced_communication.md, - 08_communication_patterns.md, - 09_porting_serial_to_mpi.md, - 10_optimising_mpi.md, + 07-derived-data-types.md, + 08_porting_serial_to_mpi.md, + 09_optimising_mpi.md, + 10_communication_patterns.md, + 11_advanced_communication.md, ] summary: | - This session introduces the Message Passing Interface, and shows how to use it to parallelise your code. + This session introduces the Message Passing Interface (MPI), teaching trainees how to use the MPI API for compiling and running applications across multiple processes. It covers point-to-point and collective communication, non-blocking functions, and handling non-contiguous memory with derived data types. Attendees will practice converting serial code to parallel, understanding scaling performance, and using profilers to optimise MPI applications. --- diff --git a/high_performance_computing/hpc_openmp/02_intro_openmp.md b/high_performance_computing/hpc_openmp/02_intro_openmp.md index b7a2eb0f..b1ceb739 100644 --- a/high_performance_computing/hpc_openmp/02_intro_openmp.md +++ b/high_performance_computing/hpc_openmp/02_intro_openmp.md @@ -34,7 +34,7 @@ In simpler terms, when your program finds a special "parallel" section, it's lik OpenMP consists of three key components that enable parallel programming using threads: - **Compiler Directives:** OpenMP makes use of special code markers known as *compiler directives* to indicate to the compiler when and how to parallelise various sections of code. These directives are prefixed with `#pragma omp`, and mark sections of code to be executed concurrently by multiple threads. -- **Runtime Library Routines:** These are predefined functions provided by the OpenMP runtime library. They allow you to control the behavior of threads, manage synchronization, and handle parallel execution. For example, we can use the function `omp_get_thread_num()` to obtain the unique identifier of the calling thread. +- **Runtime Library Routines:** These are predefined functions provided by the OpenMP runtime library. They allow you to control the behavior of threads, manage synchronisation, and handle parallel execution. For example, we can use the function `omp_get_thread_num()` to obtain the unique identifier of the calling thread. - **Environment Variables:** These are settings that can be adjusted to influence the behavior of the OpenMP runtime. They provide a way to fine-tune the parallel execution of your program. Setting OpenMP environment variables is typically done similarly to other environment variables for your system. For instance, you can adjust the number of threads to use for a program you are about to execute by specifying the value in the `OMP_NUM_THREADS` environment variable. Since parallelisation using OpenMP is accomplished by adding compiler directives to existing code structures, it's relatively easy to get started using it. @@ -44,7 +44,7 @@ However, it's worth noting that other options exist in different languages (e.g. ## Running a Code with OpenMP -Before we delve into specifics of writing code that uses OpenMP, let's first look at how we compile and run an example "Hello World!" OpenMP program that prints this to the console. +Before we get into into specifics of writing code that uses OpenMP, let's first look at how we compile and run an example "Hello World!" OpenMP program that prints this to the console. Wherever you may eventually run your OpenMP code - locally, on another machine, or on an HPC infrastructure - it's a good practice to develop OpenMP programs on your local machine first. This has the advantage of allowing you to more easily configure your development environment to suit your needs, particularly for making use of tools like Integrated Development Environments (IDEs), such as Microsoft VSCode. diff --git a/high_performance_computing/hpc_openmp/03_parallel_api.md b/high_performance_computing/hpc_openmp/03_parallel_api.md index 937f1ed0..b5af65f5 100644 --- a/high_performance_computing/hpc_openmp/03_parallel_api.md +++ b/high_performance_computing/hpc_openmp/03_parallel_api.md @@ -6,7 +6,7 @@ learningOutcomes: - Describe the functionality of OpenMP pragma directives. - Explain the concept of a parallel region and its significance in OpenMP. - Understand the scope of variables in OpenMP parallel regions. - - Implement parallelization in a program using OpenMP directives. + - Implement parallelisation in a program using OpenMP directives. - Use OpenMP library functions to manage threads and thread-specific information. - Evaluate different schedulers available in OpenMP for loop iterations. - Assess the impact of scheduling behaviors on program execution. @@ -36,7 +36,7 @@ OpenMP offers a number of directives for parallelisation, although the two we'll - The `#pragma omp parallel` directive specifies a block of code for concurrent execution. - The `#pragma omp for` directive parallelizes loops by distributing loop iterations among threads. -### Our First Parallelisation +## Our First Parallelisation For example, amending our previous example, in the following we specify a specific block of code to run parallel threads, @@ -68,7 +68,7 @@ since the order and manner in which these threads (and their `printf` statements So in summary, simply by adding this directive we have accomplished a basic form of parallelisation. -### What about Variables? +## What about Variables? So how do we make use of variables across, and within, our parallel threads? Of particular importance in parallel programs is how memory is managed and how and where variables can be manipulated, @@ -191,7 +191,7 @@ and whether they're private or shared: So here, we ensure that each thread has its own private copy of these variables, which is now thread safe. -### Parallel `for` Loops +## Parallel `for` Loops A typical program uses `for` loops to perform many iterations of the same task, and fortunately OpenMP gives us a straightforward way to parallelise them, @@ -210,8 +210,6 @@ which builds on the use of directives we've learned so far. thread_id = omp_get_thread_num(); printf("Hello from iteration %i from thread %d out of %d\n", i, thread_id, num_threads); } - - printf("%d",i); } ``` @@ -249,7 +247,7 @@ for (int 1 = 1; 1 <=10; i++) ``` In the first case, `#pragma omp parallel` spawns a group of threads, whilst `#pragma omp for` divides the loop iterations between them. -But if you only need to do parallelisation within a single loop, the second case has you covered for convenience. +But if you only need to do parallelisation within a single loop, the second case is more convenient. ::: Note we also explicitly set the number of desired threads to 4, using the OpenMP `omp_set_num_threads()` function, @@ -289,9 +287,9 @@ and prints the values received. What happens? ::: :::: -### Using Schedulers +## Using Schedulers -Whenever we use a parallel for, the iterations have to be split into smaller chunks so each thread has something to do. +Whenever we use a `parallel for`, the iterations have to be split into smaller chunks so each thread has something to do. In most OpenMP implementations, the default behaviour is to split the iterations into equal sized chunks, ```c @@ -326,6 +324,27 @@ for (int i = 0; i < NUM_ITERATIONS; ++i) { | **auto** | The best choice of scheduling is chosen at run time. | - | Useful in all cases, but can introduce additional overheads whilst it decides which scheduler to use. | | **runtime** | Determined at runtime by the `OMP_SCHEDULE` environment variable or `omp_schedule` pragma. | - | - | +:::callout + +## How the `auto` Scheduler Works + +The `auto` scheduler lets the compiler or runtime system automatically decide the best way to distribute work among threads. This is really convenient because +you don’t have to manually pick a scheduling method—the system handles it for you. It’s especially handy if your workload distribution is uncertain or changes a +lot. But keep in mind that how well `auto` works can depend a lot on the compiler. Not all compilers optimize equally well, and there might be a bit of overhead +as the runtime figures out the best scheduling method, which could affect performance in highly optimized code. + +The [OpenMP documentation](https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf) states that with `schedule(auto)`, the scheduling decision is left to the compiler or runtime system. So, how does the compiler make this decision? When using GCC, which is common in many environments including HPC, the `auto` scheduler often maps to `static` scheduling. This means it splits the work into equal chunks ahead of time for simplicity and performance. `static` scheduling is straightforward and has low overhead, which often leads to efficient execution for many applications. + +However, specialised HPC compilers, like those from Intel or IBM, might handle `auto` differently. These advanced compilers can dynamically adjust the scheduling +method during runtime, considering things like workload variability and specific hardware characteristics to optimize performance. + +So, when should you use `auto`? It’s great during development for quick performance testing without having to manually adjust scheduling methods. It’s also +useful in environments where the workload changes a lot, letting the runtime adapt the scheduling as needed. While `auto` can make your code simpler, it’s +important to test different schedulers to see which one works best for your specific application. + + +::: + ::::challenge{title="Try Out Different Schedulers"} Try each of the static and dynamic schedulers on the code below, diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index fa13d7ca..7263fde0 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -1,11 +1,11 @@ --- name: Synchronisation and Race Conditions dependsOn: [high_performance_computing.hpc_openmp.03_parallel_api] -tags: [parallelisation,synchronisation] +tags: [parallelisation, synchronisation] learningOutcomes: - - Define thread synchronization and its importance in parallel programming. + - Define thread synchronisation and its importance in parallel programming. - Explain what race conditions are and how they occur in parallel programs. - - Implement thread synchronization mechanisms to prevent race conditions. + - Implement thread synchronisation mechanisms to prevent race conditions. - Modify code to avoid race condition errors. --- @@ -153,7 +153,9 @@ other threads to catch up. There is no way around this synchronisation overhead, barriers or have an uneven amount of work between threads. This overhead increases with the number of threads in use, and becomes even worse when the workload is uneven killing the parallel scalability. -:::callout{title="Blocking thread execution and `nowait`"} +::::callout + +### Blocking thread execution and `nowait` Most parallel constructs in OpenMP will synchronise threads before they exit the parallel region. For example, consider a parallel for loop. If one thread finishes its work before the others, it doesn't leave the parallel region @@ -179,7 +181,7 @@ a `nowait` clause is used with a parallel for. } ``` -::: +:::: ### Synchronisation regions @@ -242,6 +244,12 @@ Create a program that updates a shared counter to track the progress of a parall synchronisation region you can use. Can you think of any potential problems with your implementation, what happens when you use different loop schedulers? You can use the code example below as your starting point. +NB: To compile this you’ll need to add `-lm` to inform the linker to link to the math C library, e.g. + +```bash +gcc counter.c -o counter -fopenmp -lm +``` + ```c #include #include diff --git a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md index b592a873..0befc0f8 100644 --- a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md +++ b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md @@ -9,29 +9,29 @@ learningOutcomes: - Describe hybrid parallelism and its relevance in modern computing. - Identify the advantages and disadvantages of combining OpenMP and MPI for parallel computing tasks. - Assess the suitability of hybrid parallelism for specific software applications. - - Compare and contrast the performance of hybrid parallelism with other parallelization approaches. + - Compare and contrast the performance of hybrid parallelism with other parallelisation approaches. --- At this point in the lesson, we've introduced the basics you need to get out there and start writing parallel code using -OpenMP. There is one thing still worth being brought to your attention, and that is *hybrid parallelism*. +OpenMP. There is one thing still worth being brought to your attention, and that is **hybrid parallelism**. :::callout ## The Message Passing Interface (MPI) In this episode, we will assume you have some knowledge about the Message Passing Interface (MPI) and that you have a -basic understand of how to paralleise code using MPI. If you're not sure, you can think of MPI as being like an OpenMP +basic understanding of how to parallelise code using MPI. If you're not sure, you can think of MPI as being like an OpenMP program where everything is in a `pragma omp parallel` directive. ::: ## What is hybrid parallelism? -When we talk about hybrid paralleism, what we're really talking about is writing parallel code using more than one +When we talk about hybrid parallelism, what we're really talking about is writing parallel code using more than one parallelisation paradigm. The reason we want to do this is to take advantage of the strengths of each paradigm to improve the performance, scaling and efficiency of our parallel core. The most common form of hybrid parallelism in research is *MPI+X*. What this means is that an application is *mostly* parallelised using the Message Passing Interface -(MPI), which has been extended using some +X other paradigm. A common +X is OpenMP, creating MPI+OpenMP. +(MPI), which has been extended using some +X other paradigm. A common example of +X is OpenMP, creating MPI+OpenMP. :::callout @@ -113,7 +113,7 @@ Most of this can, however, be mitigated with good documentation and a robust bui ## When do I need to use hybrid parallelism? So, when should we use a hybrid scheme? A hybrid scheme is particularly beneficial in scenarios where you need to -leverage the strength of both the shared- and distributed-memory parallelism paradigms. MPI is used to exploit lots of +leverage the strength of both the shared and distributed-memory parallelism paradigms. MPI is used to exploit lots of resources across nodes on a HPC cluster, whilst OpenMP is used to efficiently (and somewhat easily) parallelise the work each MPI task is required to do. @@ -129,13 +129,13 @@ requirements, or to take a different approach to improve the work balance. To demonstrate how to use MPI+OpenMP, we are going to write a program which computes an approximation for $\pi$ using a [Riemann sum](https://en.wikipedia.org/wiki/Riemann_sum). This is not a great example to extol the virtues of hybrid parallelism, as it is only a small problem. However, it is a simple problem which can be easily extended and -parallelised. Specifically, we will write a program to solve to integral to compute the value of $\pi$, +parallelised. Specifically, we will write a program to solve the integral to compute the value of $\pi$, $$ \int_{0}^{1} \frac{4}{1 + x^{2}} ~ \mathrm{d}x = 4 \tan^{-1}(x) = \pi $$ There are a plethora of methods available to numerically evaluate this integral. To keep the problem simple, we will re-cast the integral into a easier-to-code summation. How we got here isn't that important for our purposes, but what we -will be implementing in code is the follow summation, +will be implementing in code is the following summation, $$ \pi = \lim_{n \to \infty} \sum_{i = 0}^{n} \frac{1}{n} ~ \frac{4}{1 + x_{i}^{2}} $$ @@ -185,7 +185,7 @@ int main(void) } ``` -In the above, we are using $N = 10^{10}$ rectangles (using this number of rectangles is overkill, but is used to +In the above, we are using $N = 10^{10}$ rectangles. Although this number of rectangles is overkill, it is used to demonstrate the performance increases from parallelisation. If we save this (as `pi.c`), compile and run we should get output as below, @@ -198,10 +198,10 @@ Total time = 34.826832 seconds You should see that we've compute an accurate approximation of $\pi$, but it also took a very long time at 35 seconds! To speed this up, let's first parallelise this using OpenMP. All we need to do, for this simple application, is to use a -parallel for to split the loop between OpenMP threads as shown below. +`parallel for` to split the loop between OpenMP threads as shown below. ```c -/* Parallelise the loop using a parallel for directive. We will set the sum +/* Parallelise the loop using a `parallel for` directive. We will set the sum variable to be a reduction variable. As it is marked explicitly as a reduction variable, we don't need to worry about any race conditions corrupting the final value of sum */ @@ -235,13 +235,12 @@ implementing MPI. In this example, we can porting an OpenMP code to a hybrid MPI also done this the other way around by porting an MPI code into a hybrid application. Neither *"evolution"* is more common or better than the other, the route each code takes toward becoming hybrid is different. -So, how do we split work using a hybrid approach? One approach for an embarrassingly parallel problem, such as the one -we're working on is to can split the problem size into smaller chunks *across* MPI ranks, and to use OpenMP to -parallelise the work. For example, consider a problem where we have to do a calculation for 1,000,000 input parameters. -If we have four MPI ranks each of which will spawn 10 threads, we could split the work evenly between MPI ranks so each -rank will deal with 250,000 input parameters. We will then use OpenMP threads to do the calculations in parallel. If we -use a sequential scheduler, then each thread will do 25,000 calculations. Or we could use OpenMP's dynamic scheduler to -automatically balance the workload. We have implemented this situation in the code example below. +So, how do we split work using a hybrid approach? For an embarrassingly parallel problem, such as the one we're working on, +we can split the problem size into smaller chunks across MPI ranks and use OpenMP to parallelise the work. For example, consider +a problem where we have to do a calculation for 1,000,000 input parameters. If we have four MPI ranks each of which will spawn 10 threads, +we could split the work evenly between MPI ranks so each rank will deal with 250,000 input parameters. We will then use OpenMP +threads to do the calculations in parallel. If we use a sequential scheduler, then each thread will do 25,000 calculations. Or we +could use OpenMP's dynamic scheduler to automatically balance the workload. We have implemented this situation in the code example below. ```c /* We have an array of input parameters. The calculation which uses these parameters @@ -264,9 +263,9 @@ for (int i = rank_lower_limit; i < rank_upper_limit; ++i) { } ``` -:::callout +::::callout -## Still not sure about MPI? +### Still not sure about MPI? If you're still a bit unsure of how MPI is working, you can basically think of it as wrapping large parts of your code in a `pragma omp parallel` region as we saw in an earlier episode. We can re-write the code example above in the @@ -290,7 +289,7 @@ struct input_par_t input_parameters[total_work]; } ``` -::: +:::: In the above example, we have only included the parallel region of code. It is unfortunately not as simple as this, because we have to deal with the additional complexity from using MPI. We need to initialise MPI, as well as communicate @@ -366,7 +365,7 @@ int main(void) So you can see that it's much longer and more complicated; although not much more than a [pure MPI implementation](code/examples/05-pi-mpi.c). To compile our hybrid program, we use the MPI compiler command `mpicc` with -the argument `-fopenmp`. We can then either run our compiled program using `mpirun`. +the argument `-fopenmp`. We can then run our compiled program using `mpirun`. ```bash mpicc -fopenmp 05-pi-omp-mpi.c -o pi.exe diff --git a/high_performance_computing/hpc_openmp/index.md b/high_performance_computing/hpc_openmp/index.md index d76ae4d5..300a409a 100644 --- a/high_performance_computing/hpc_openmp/index.md +++ b/high_performance_computing/hpc_openmp/index.md @@ -11,5 +11,5 @@ files: [ 05_hybrid_parallelism.md ] summary: | - TBA. + This session covers the fundamentals of OpenMP, including its API, compilation, and execution. Participants will learn to parallelise their programs using OpenMP's pragma directives, manage threads, and control variable scoping and loop scheduling. The session also addresses thread synchronization, race conditions, and the use of hybrid parallelism with OpenMP and MPI. --- diff --git a/high_performance_computing/hpc_parallel_intro/01_introduction.md b/high_performance_computing/hpc_parallel_intro/01_introduction.md index ffd4ab4d..09ad5cf0 100644 --- a/high_performance_computing/hpc_parallel_intro/01_introduction.md +++ b/high_performance_computing/hpc_parallel_intro/01_introduction.md @@ -1,7 +1,13 @@ --- name: Introduction to Parallelism dependsOn: [] -tags: [] +tags: [parallelisation, OMP, MPI] +learningOutcomes: + - Understand the basic concepts of parallelization and parallel programming. + - Compare shared memory and distributed memory models. + - Describe different parallel paradigms, including data parallelism and message passing. + - Differentiate between sequential and parallel computing. + - Explain the roles of processes and threads in parallel programming. attribution: - citation: > "Introduction to the Message Passing Interface" course by the Southampton RSG @@ -9,10 +15,10 @@ attribution: image: https://southampton-rsg-training.github.io/dirac-intro-to-mpi/assets/img/home-logo.png license: CC-BY-4.0 --- + Parallel programming has been important to scientific computing for decades as a way to decrease program run times, making more complex analyses possible (e.g. climate modeling, gene sequencing, pharmaceutical development, aircraft design). -During this course you will learn to design parallel algorithms and write parallel programs using the **MPI** library. MPI stands for **Message Passing Interface**, and is a low level, minimal and extremely flexible set of commands for communicating between copies of a program. -Before we dive into the details of MPI, let's first familiarize ourselves with key concepts that lay the groundwork for parallel programming. +In this episode, we will cover the foundational concepts of parallelisation. Before we get into the details of parallel programming libraries and techniques, let's first familiarise ourselves with the key ideas that underpin parallel computing. ## What is Parallelisation? @@ -38,6 +44,7 @@ This can allow us to do much more at once, and therefore get results more quickl | --- | --- | | ![Serial Computing](fig/serial2_prog.png) | ![Parallel Computing](fig/parallel_prog.png) | + ::::callout ## Analogy @@ -56,7 +63,7 @@ If we have 2 or more painters for the job, then the tasks can be performed in ** ::::callout -## Key idea +## Key Idea In our analogy, the painters represent CPU cores in the computers. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. @@ -84,12 +91,10 @@ To efficiently utilize multiple CPU cores, we need to understand the concepts of These concepts form the foundation of parallel computing and play a crucial role in achieving optimal parallel execution. To address the challenges that arise when parallelising programs across multiple cores and achieve efficient use of available resources, parallel programming frameworks like MPI and OpenMP (Open Multi-Processing) come into play. -These frameworks provide tools, libraries, and methodologies to handle memory management, workload distribution, communication, and synchronization in parallel environments. +These frameworks provide tools, libraries, and methodologies to handle memory management, workload distribution, communication, and synchronisation in parallel environments. Now, let's take a brief look at these fundamental concepts and explore the differences between MPI and OpenMP, setting the stage for a deeper understanding of MPI in the upcoming episodes. -::::callout - ## Processes A process refers to an individual running instance of a software program. @@ -118,11 +123,10 @@ However, it's important to note that threads within a process are limited to a s While they provide an effective means of utilizing multiple CPU cores on a single machine, they cannot extend beyond the boundaries of that computer. ![Threads](fig/multithreading.svg) -:::: ::::callout -### Analogy +## Analogy Let's go back to our painting 4 walls analogy. Our example painters have two arms, and could potentially paint with both arms at the same time. @@ -132,8 +136,6 @@ The painters’ arms represent a _**“thread”**_ of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously. :::: -::::callout - ## Shared vs Distributed Memory Shared memory refers to a memory model where multiple processors can directly access and modify @@ -144,25 +146,23 @@ Shared memory programming models, like OpenMP, simplify parallel programming by Distributed memory, on the other hand, involves memory resources that are physically separated across different computers or nodes in a network. Each processor has its own private memory, and explicit communication is required to exchange data between processors. -Distributed memory programming models, such as MPI, facilitate communication and synchronization in this memory model. +Distributed memory programming models, such as MPI, facilitate communication and synchronisation in this memory model. ![Shared Memory and Distributed Memory](fig/memory-pattern.png) ## Differences/Advantages/Disadvantages of Shared and Distributed Memory - **Accessibility:** Shared memory allows direct access to the same memory space by all processors, while distributed memory requires explicit communication for data exchange between processors. -- **Memory Scope:** Shared memory provides a global memory space, enabling easy data sharing and synchronization. +- **Memory Scope:** Shared memory provides a global memory space, enabling easy data sharing and synchronisation. In distributed memory, each processor has its own private memory space, requiring explicit communication for data sharing. - **Memory Consistency:** Shared memory ensures immediate visibility of changes made by one processor to all other processors. - Distributed memory requires explicit communication and synchronization to maintain data consistency across processors. + Distributed memory requires explicit communication and synchronisation to maintain data consistency across processors. - **Scalability:** Shared memory systems are typically limited to a single computer or node, whereas distributed memory systems can scale to larger configurations with multiple computers and nodes. - **Programming Complexity:** Shared memory programming models offer simpler constructs and require less explicit communication compared to distributed memory models. Distributed memory programming involves explicit data communication and synchronization, adding complexity to the programming process. - -:::: - + ::::callout -### Analogy +## Analogy Imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. @@ -173,6 +173,9 @@ In this scenario, each worker can complete their task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, in different houses in the city, and different cities in the country. We need, however, a communication system in place. Suppose that worker A, for some reason, needs a colour that is only available in the dispenser of worker B, they must then synchronise: worker A must request the paint of worker B and worker B must respond by sending the required colour. +:::: + +::::callout ## Key Idea @@ -220,12 +223,10 @@ for(i=0; i Date: Fri, 9 Aug 2024 14:15:05 +0100 Subject: [PATCH 02/34] Add challenge IDs --- high_performance_computing/hpc_openmp/02_intro_openmp.md | 2 +- high_performance_computing/hpc_openmp/03_parallel_api.md | 8 ++++---- .../hpc_openmp/04_synchronisation.md | 6 +++--- .../hpc_openmp/05_hybrid_parallelism.md | 2 +- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/high_performance_computing/hpc_openmp/02_intro_openmp.md b/high_performance_computing/hpc_openmp/02_intro_openmp.md index b1ceb739..ae367ad1 100644 --- a/high_performance_computing/hpc_openmp/02_intro_openmp.md +++ b/high_performance_computing/hpc_openmp/02_intro_openmp.md @@ -13,7 +13,7 @@ learningOutcomes: OpenMP is an industry-standard API specifically designed for parallel programming in shared memory environments. It supports programming in languages such as C, C++, and Fortran. OpenMP is an open source, industry-wide initiative that benefits from collaboration among hardware and software vendors, governed by the OpenMP Architecture Review Board ([OpenMP ARB](https://www.openmp.org/)). -::::challenge{title="An OpenMP Timeline"} +::::challenge{id=timeline, title="An OpenMP Timeline"} If you're interested, there's a [timeline of how OpenMP developed](https://www.openmp.org/uncategorized/openmp-timeline/). It provides an overview of OpenMP's evolution until 2014, with significant advancements diff --git a/high_performance_computing/hpc_openmp/03_parallel_api.md b/high_performance_computing/hpc_openmp/03_parallel_api.md index b5af65f5..4d7ac53b 100644 --- a/high_performance_computing/hpc_openmp/03_parallel_api.md +++ b/high_performance_computing/hpc_openmp/03_parallel_api.md @@ -98,7 +98,7 @@ Hello from thread 3 out of 4 Hello from thread 2 out of 4 ``` -::::challenge{title='OpenMP and C Scoping'} +::::challenge{id=scoping, title='OpenMP and C Scoping'} Try printing out `num_threads` at the end of the program, after the `#pragma` code block, and recompile. What happens? Is this what you expect? :::solution @@ -274,7 +274,7 @@ using OpenMP to parallelise an existing loop is often quite straightforward. However, particularly with more complex programs, there are some aspects and potential pitfalls with OpenMP parallelisation we need to be aware of - such as race conditions - which we'll explore in the next episode. -::::challenge{title="Calling Thread Numbering Functions Elsewhere?"} +::::challenge{id=callingelsewhere, title="Calling Thread Numbering Functions Elsewhere?"} Write, compile and run a simple OpenMP program that calls both `omp_get_num_threads()` and `omp_get_thread_num()` outside of a parallel region, and prints the values received. What happens? @@ -345,7 +345,7 @@ important to test different schedulers to see which one works best for your spec ::: -::::challenge{title="Try Out Different Schedulers"} +::::challenge{id=differentschedulers, title="Try Out Different Schedulers"} Try each of the static and dynamic schedulers on the code below, which uses `sleep` to mimic processing iterations that take increasing amounts of time to complete as the loop increases. @@ -395,7 +395,7 @@ threads that complete need to stop and await a new value to process from a next ::: :::: -::::challenge +::::challenge{id=differentchunksizes, title="Different Chunk Sizes"} With a dynamic scheduler, the default chunk size is 1. What happens if specify a chunk size of 2, i.e. `scheduler(dynamic, 2)`? diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index 7263fde0..0c8edf4a 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -64,7 +64,7 @@ condition in OpenMP. Different threads accessing and modifying the same part of inconsistent memory operations and probably an incorrect result. ::: -::::challenge{title="Identifying race conditions"} +::::challenge{id=identifyraceconditions, title="Identifying race conditions"} Take a look at the following code example. What's the output when you compile and run this program? Where do you think the race condition is? @@ -238,7 +238,7 @@ other threads have finished with it. However in reality we shouldn't write a red clause](https://www.intel.com/content/www/us/en/docs/advisor/user-guide/2023-0/openmp-reduction-operations.html) in the `parallel for` directive, e.g. `#pragma omp parallel for reduction(+:value)` -::::challenge{title="Reporting progress"} +::::challenge{id=reportingprogress, title="Reporting progress"} Create a program that updates a shared counter to track the progress of a parallel loop. Think about which type of synchronisation region you can use. Can you think of any potential problems with your implementation, what happens @@ -447,7 +447,7 @@ When comparing critical regions and locks, it is often better to use a critical simplicity of using a critical region. ::: -::::challenge{title="Remove the race condition"} +::::challenge{id=removeracecondition, title="Remove the race condition"} In the following program, an array of values is created and then summed together using a parallel for loop. diff --git a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md index 0befc0f8..80aaae67 100644 --- a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md +++ b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md @@ -431,7 +431,7 @@ delicate balance of balancing overheads associated with thread synchronisation i MPI. As mentioned earlier, a hybrid implementation will typically be slower than a "pure" MPI implementation for example. ::: -::::challenge{title="Optimum combination of threads and ranks for approximating Pi"} +::::challenge{id=optimumcombo, title="Optimum combination of threads and ranks for approximating Pi"} Try various combinations of the number of OpenMP threads and number of MPI processes. For this program, what's faster? Only using [MPI](code/examples/05-pi-mpi.c), only using [OpenMP](code/examples/05-pi-omp.c) or a From e2d9e8f9b1984375bed94b2b823a584d854c7268 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 12:45:21 +0000 Subject: [PATCH 03/34] Episode 2 updates: removed redundant 'c++' mention and added pedantic note about omp.h --- .../hpc_openmp/02_intro_openmp.md | 23 +++++++++++++------ 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/high_performance_computing/hpc_openmp/02_intro_openmp.md b/high_performance_computing/hpc_openmp/02_intro_openmp.md index ae367ad1..33cce4b9 100644 --- a/high_performance_computing/hpc_openmp/02_intro_openmp.md +++ b/high_performance_computing/hpc_openmp/02_intro_openmp.md @@ -11,7 +11,8 @@ learningOutcomes: ## What is OpenMP? -OpenMP is an industry-standard API specifically designed for parallel programming in shared memory environments. It supports programming in languages such as C, C++, and Fortran. OpenMP is an open source, industry-wide initiative that benefits from collaboration among hardware and software vendors, governed by the OpenMP Architecture Review Board ([OpenMP ARB](https://www.openmp.org/)). +OpenMP is an industry-standard API specifically designed for parallel programming in shared memory environments. It supports programming in languages such as C, C++, +and Fortran. OpenMP is an open source, industry-wide initiative that benefits from collaboration among hardware and software vendors, governed by the OpenMP Architecture Review Board ([OpenMP ARB](https://www.openmp.org/)). ::::challenge{id=timeline, title="An OpenMP Timeline"} @@ -35,16 +36,17 @@ OpenMP consists of three key components that enable parallel programming using t - **Compiler Directives:** OpenMP makes use of special code markers known as *compiler directives* to indicate to the compiler when and how to parallelise various sections of code. These directives are prefixed with `#pragma omp`, and mark sections of code to be executed concurrently by multiple threads. - **Runtime Library Routines:** These are predefined functions provided by the OpenMP runtime library. They allow you to control the behavior of threads, manage synchronisation, and handle parallel execution. For example, we can use the function `omp_get_thread_num()` to obtain the unique identifier of the calling thread. -- **Environment Variables:** These are settings that can be adjusted to influence the behavior of the OpenMP runtime. They provide a way to fine-tune the parallel execution of your program. Setting OpenMP environment variables is typically done similarly to other environment variables for your system. For instance, you can adjust the number of threads to use for a program you are about to execute by specifying the value in the `OMP_NUM_THREADS` environment variable. +- **Environment Variables:** These are settings that can be adjusted to influence the behavior of the OpenMP runtime. They provide a way to fine-tune the parallel execution of your program. Setting OpenMP environment variables is typically done similarly to other environment variables for your system. +For instance, you can adjust the number of threads to use for a program you are about to execute by specifying the value in the `OMP_NUM_THREADS` environment variable. Since parallelisation using OpenMP is accomplished by adding compiler directives to existing code structures, it's relatively easy to get started using it. This also means it's straightforward to use on existing code, so it can prove a good approach to migrating serial code to parallel. -Since OpenMP support is built into existing compilers, it's also a defacto standard for C parallel programming. -However, it's worth noting that other options exist in different languages (e.g. there are c++many options in C++, the [multiprocessing library](https://docs.python.org/3/library/multiprocessing.html) for Python, [Rayon](https://docs.rs/rayon/latest/rayon/) for Rust). +Since OpenMP support is built into existing compilers, it's also a de facto standard for C parallel programming. +However, it's worth noting that other options exist in different languages (e.g. there are many options in C++, the [multiprocessing library](https://docs.python.org/3/library/multiprocessing.html) for Python, [Rayon](https://docs.rs/rayon/latest/rayon/) for Rust). ## Running a Code with OpenMP -Before we get into into specifics of writing code that uses OpenMP, let's first look at how we compile and run an example "Hello World!" OpenMP program that prints this to the console. +Before we get into specifics of writing code that uses OpenMP, let's first look at how we compile and run an example "Hello World!" OpenMP program that prints this to the console. Wherever you may eventually run your OpenMP code - locally, on another machine, or on an HPC infrastructure - it's a good practice to develop OpenMP programs on your local machine first. This has the advantage of allowing you to more easily configure your development environment to suit your needs, particularly for making use of tools like Integrated Development Environments (IDEs), such as Microsoft VSCode. @@ -63,8 +65,15 @@ int main() { } } ~~~ +:::callout{variant="note"} +In this example, `#include ` is not strictly necessary since the code does +not call OpenMP runtime functions. However, it is a good practice to include this header to make it clear that +the program uses OpenMP and to prepare for future use of OpenMP library functions. +::: -You'll likely want to compile it using a standard compiler such as `gcc`, although this may depend on your system. To enable the creation of multi-threaded code based on OpenMP directives, pass the `-fopenmp` flag to the compiler. This flag indicates that you're compiling an OpenMP program: +You'll likely want to compile it using a standard compiler such as `gcc`, although this may depend on your +system. To enable the creation of multi-threaded code based on OpenMP directives, pass the `-fopenmp` flag to the compiler. +This flag indicates that you're compiling an OpenMP program: ~~~bash gcc hello_world_omp.c -o hello_world_omp -fopenmp @@ -107,4 +116,4 @@ If you're looking to develop OpenMP programs in VSCode, here are three configura You may need to adapt the `tasks.json` and `launch.json` depending on your platform (in particular, the `program` field in `launch.json` may need to reference a `hello_world_omp.exe` file if running on Windows, and the location of gcc in the `command` field may be different in `tasks.json`). Once you've compiled `hello_world_omp.c` the first time, then, by selecting VSCode's `Run and Debug` tab on the left, the `C++ OpenMP: current file` configuration should appear in the top left which will set `OMP_NUM_THREADS` before running it. -:::: +:::: \ No newline at end of file From fcaff2a6a99daa8a518752bb96af44b7a3f07a0e Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 12:49:34 +0000 Subject: [PATCH 04/34] Episode 3 updates: improved private variables explanation, added nested loops callout with collapse clause, fixed scheduler terminology, and addressed firstprivate caveat. Also fixed typos. --- .../hpc_openmp/03_parallel_api.md | 108 +++++++++++++++--- 1 file changed, 95 insertions(+), 13 deletions(-) diff --git a/high_performance_computing/hpc_openmp/03_parallel_api.md b/high_performance_computing/hpc_openmp/03_parallel_api.md index 4d7ac53b..df3423c0 100644 --- a/high_performance_computing/hpc_openmp/03_parallel_api.md +++ b/high_performance_computing/hpc_openmp/03_parallel_api.md @@ -73,10 +73,13 @@ So in summary, simply by adding this directive we have accomplished a basic form So how do we make use of variables across, and within, our parallel threads? Of particular importance in parallel programs is how memory is managed and how and where variables can be manipulated, and OpenMP has a number of mechanisms to indicate how they should be handled. -Essentially, OpenMP provided two ways to do this for variables: +Essentially, OpenMP provides two ways to do this for variables: -- **Shared**: holds a single instance for all threads to share -- **Private**: creates and hold a separate copy of the variable for each thread +- **Shared**: A single instance of the variable is shared among all threads, meaning every thread can access +and modify the same data. This is useful for shared resources but requires careful management to prevent conflicts or +unintended behavior. +- **Private**: Each thread gets its own isolated copy of the variable, similar to how variables are +private in `if` statements or functions, where each thread’s version is independent and doesn't affect others. For example, what if we wanted to hold the thread ID and the total number of threads within variables in the code block? Let's start by amending the parallel code block to the following: @@ -123,17 +126,58 @@ But what about declarations outside of this block? For example: ``` Which may seem on the surface to be correct. -However this illustrates a critical point about why we need to be careful. -Now the variables declarations are outside of the parallel block, -by default, variables are *shared* across threads, which means these variables can be changed at any time by -any thread, which is potentially dangerous. -So here, `thread_id` may hold the value for another thread identifier when it's printed, -since there is an opportunity between it's assignment and it's access within `printf` to be changed in another thread. +However, this illustrates a critical point about why we need to be careful. +Now, since the variable declarations are outside the parallel block, they are, +by default, *shared* across threads. This means any thread can modify these variables at any time, +which is potentially dangerous. So here, `thread_id` may hold the value for another thread identifier when it's printed, +since there is an opportunity between its assignment and its access within `printf` to be changed in another thread. This could be particularly problematic with a much larger data set and complex processing of that data, where it might not be obvious that incorrect behaviour has happened at all, and lead to incorrect results. This is known as a *race condition*, and we'll look into them in more detail in the next episode. +But there’s another common scenario to watch out for. What happens when we want to declare a variable +outside the parallel region, make it private, and retain its initial value inside the block? +Let’s consider the following example: + +```c +int initial_value = 15; + +#pragma omp parallel private(initial_value) +{ + printf("Thread %d sees initial_value = %d\n", omp_get_thread_num(), initial_value); +} +``` +You might expect each thread to start with `initial_value` set to `15`. +However, this is not the case. When a variable is declared as `private`, each thread gets its own copy +of the variable, but those copies are **uninitialised**—they don’t inherit the value from the variable outside +the parallel region. As a result, the output may vary and include seemingly random numbers, depending on the +compiler and runtime. + +To handle this, you can use the `firstprivate` directive. With `firstprivate`, each thread gets its own +private copy of the variable, and those copies are initialised with the value from the variable outside the +parallel region. For example: + +```c +int initial_value = 15; + +#pragma omp parallel firstprivate(initial_value) +{ + printf("Thread %d sees initial_value = %d\n", omp_get_thread_num(), initial_value); +} +``` +Now, the initial value is correctly passed to each thread: + +```text +Thread 0 sees initial_value = 15 +Thread 1 sees initial_value = 15 +Thread 2 sees initial_value = 15 +Thread 3 sees initial_value = 15 + +``` +Each thread begins with initial_value set to `15`. This avoids the unpredictable +behavior of uninitialised variables and ensures that the initial value is preserved for each thread. + ::::callout ## Observing the Race Condition @@ -220,6 +264,44 @@ and how to specify different scheduling behaviours. :::callout +## Nested Loops with `collapse` + +By default, OpenMP parallelises only the outermost loop in a nested structure. This works fine for many cases, +but what if the outer loop doesn’t have enough iterations to keep all threads busy, or the inner loop does most +of the work? In these situations, we can use the `collapse` clause to combine the iteration +spaces of multiple loops into a single loop for parallel execution. + +For example, consider a nested loop structure: + +```c +#pragma omp parallel for +for (int i = 0; i < N; i++) { + for (int j = 0; j < M; j++) { + ... + } +} +``` +Without the `collapse` clause, the outer loop is divided into `N` iterations, and the inner loop is executed sequentially +within each thread. If `N` is small or `M` contains the bulk of the work, some threads might finish their work quickly +and sit idle, waiting for others to complete. This imbalance can slow down the overall execution of the program. + +Adding `collapse` changes this: + +```c +#pragma omp parallel for collapse(2) +for (int i = 0; i < N; i++) { + for (int j = 0; j < M; j++) { + ... + } +} +``` +The number `2` in `collapse(2)` specifies how many nested loops to combine. +Here, the two loops `(i and j)` are combined into a single iteration space with `N * M` iterations. +These iterations are then distributed across the threads, ensuring a more balanced workload. +::: + +:::callout + ## A Shortcut for Convenience The `#pragma omp parallel for` is actually equivalent to using two separate directives. @@ -229,7 +311,7 @@ For example: #pragma omp parallel { #pragma omp for - for (int 1 = 1; 1 <=10; i++) + for (int i = 1; i <=10; i++) { ... } @@ -240,7 +322,7 @@ For example: ```c #pragma omp parallel for -for (int 1 = 1; 1 <=10; i++) +for (int i = 1; i <=10; i++) { ... } @@ -304,7 +386,7 @@ wait until the others are done before the program can continue, but it's also an threads/cores idling rather than doing work. Fortunately, we can use other types of "scheduling" to control how work is divided between threads. In simple terms, a -scheduler is an algorithm which decides how to assign chunks of work to the threads. We can controller the scheduler we +scheduler is an algorithm which decides how to assign chunks of work to the threads. We can control the schedule we want to use with the `schedule` directive: ```c @@ -434,4 +516,4 @@ Then rerun. Try it with different chunk sizes too, e.g.: export OMP_SCHEDULE=static,1 ``` -::: +::: \ No newline at end of file From 478fbae9f1a0bb9f6b60cff8a30ba1917b72bdf0 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 12:54:14 +0000 Subject: [PATCH 05/34] Episode 4 updates: added progress counter example with race condition, expanded and clarified scheduler explanations, added compilable atomic example, new section on reduction clauses with examples, and clarified nested region restrictions. Fixed issues with synchronization routines, race condition explanation, and thread-specific matrix updates. Addressed indentation, typos, and other minor improvements. --- .../hpc_openmp/04_synchronisation.md | 321 ++++++++++++------ 1 file changed, 226 insertions(+), 95 deletions(-) diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index 0c8edf4a..c2475aae 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -34,7 +34,7 @@ Synchronisation is also important for data dependency, to make sure that a threa This is particularly important for algorithms which are iterative. The synchronisation mechanisms in OpenMP are incredibly important tools, as they are used to avoid race conditions. Race -conditions are have to be avoided, otherwise they *can* result in data inconsistencies if we have any in our program. A +conditions have to be avoided, otherwise they *can* result in data inconsistencies if we have any in our program. A race condition happens when two, or more, threads access and modify the same piece of data at the same time. To illustrate this, consider the diagram below: @@ -46,10 +46,10 @@ can't actually guarantee what it will be. If both threads access and modify the final value will be 1. That's because both variables read the initial value of 0, increment it by 1, and write to the shared variable. -In this case, it doesn't matter if the variable update does or doesn't happen concurrently. The inconsistency stems from +In this case, it doesn't matter if the variable update does or does not happen concurrently. The inconsistency stems from the value initially read by each thread. If, on the other hand, one thread manages to access and modify the variable -before the other thread can read its value, then we'll get the value we expect (2). For example, if thead 0 increments -the variable before thread 1 reads it, then thread 1 will read a value of 1 and increment that by 1 givusing us the +before the other thread can read its value, then we'll get the value we expect (2). For example, if thread 0 increments +the variable before thread 1 reads it, then thread 1 will read a value of 1 and increment that by 1 giving us the correct value of 2. This illustrates why it's called a race condition, because threads race each other to access and modify variables before another thread can! @@ -91,7 +91,7 @@ int main(void) { :::solution What you will notice is that when you run the program, the final value changes each time. The correct final value is -10,000 but you will often get a value that is lower than this. This is caused by a race condition, as explained in +10,000, but you will often get a value that is lower than this. This is caused by a race condition, as explained in the previous diagram where threads are incrementing the value of `value` before another thread has finished with it. So the race condition is in the parallel loop and happens because of threads reading the value of `value` before it @@ -109,11 +109,11 @@ potentially limit access to tasks or data to certain threads. ### Barriers -Barriers are a the most basic synchronisation mechanism. They are used to create a waiting point in our program. When a +Barriers are the most basic synchronisation mechanism. They are used to create a waiting point in our program. When a thread reaches a barrier, it waits until all other threads have reached the same barrier before continuing. To add a barrier, we use the `#pragma omp barrier` directive. In the example below, we have used a barrier to synchronise threads -such that they don't start the main calculation of the program until a look up table has been initialised (in parallel), -as the calculation depends on this data. +such that they don't start the main calculation of the program until a look-up table has been initialised (in parallel), +as the calculation depends on this data (See the full code [here](code/examples/04-barriers.c)). ```c #pragma omp parallel @@ -125,28 +125,42 @@ as the calculation depends on this data. #pragma omp barrier /* As all threads depend on the table, we have to wait until all threads are done and have reached the barrier */ - + + /* Each thread then proceeds to its main calculation */ do_main_calculation(thread_id); } ``` - -We can also put a barrier into a parallel for loop. In the next example, a barrier is used to ensure that the -calculation for `new_matrix` is done before it is copied into `old_matrix`. +Similarly, in iterative tasks like matrix calculations, barriers help coordinate threads so that all updates +are finished before moving to the next step. For example, in the following snippet, a barrier makes sure that updates +to `new_matrix` are completed before it is copied into `old_matrix`: ```c -double old_matrix[NX][NY]; -double new_matrix[NX][NY]; +...... +int main() { + double old_matrix[NX][NY]; + double new_matrix[NX][NY]; -#pragma omp parallel for -for (int i = 0; i < NUM_ITERATIONS; ++i) { - int thread_id = omp_get_thread_num(); - iterate_matrix_solution(old_matrix, new_matrix, thread_id); - - #pragma omp barrier /* You may want to wait until new_matrix has been updated by all threads */ - - copy_matrix(new_matrix, old_matrix); + #pragma omp parallel + { + for (int i = 0; i < NUM_ITERATIONS; ++i) { + int thread_id = omp_get_thread_num(); + + iterate_matrix_solution(old_matrix, new_matrix, thread_id); /* Each thread updates a portion of the matrix */ + + #pragma omp barrier /* You may want to wait until new_matrix has been updated by all threads */ + + copy_matrix(new_matrix, old_matrix); + } + } } + ``` +:::callout{variant='note'} +OpenMP does not allow barriers to be placed directly inside `#pragma omp parallel for` loops due to restrictions +on closely [nested regions](https://www.openmp.org/spec-html/5.2/openmpse101.html#x258-27100017.1). To coordinate threads +effectively in iterative tasks like this, the loop has been rewritten using a `#pragma omp parallel` construct with +explicit loop control. You can find the full code for this example [here](code/examples/04-matrix-update.c). +::: Barriers introduce additional overhead into our parallel algorithms, as some threads will be idle whilst waiting for other threads to catch up. There is no way around this synchronisation overhead, so we need to be careful not to overuse @@ -191,15 +205,15 @@ which are used to prevent multiple threads from executing the same piece of code one of these regions, they queue up and wait their turn to access the data and execute the code within the region. The table below shows the types of synchronisation region in OpenMP. -| Region | Description | Directive | -| - | - | - | +| Region | Description | Directive | +|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------| | **critical** | Only one thread is allowed in the critical region. Threads have to queue up to take their turn. When one thread is finished in the critical region, it proceeds to execute the next chunk of code (not in the critical region) immediately without having to wait for other threads to finish. | `#pragma omp critical` | -| **single** | Single regions are used for code which needs to be executed only by a single thread, such as for I/O operations. The first thread to reach the region will execute the code, whilst the other threads will behave as if they've reached a barrier until the executing thread is finished. | `#pragma omp single` | -| **master** | A master region is identical to the single region other than that execution is done by the designated master thread (usually thread 0). | `#pragma omp master` | +| **single** | Single regions are used for code which needs to be executed only by a single thread, such as for I/O operations. The first thread to reach the region will execute the code, whilst the other threads will behave as if they've reached a barrier until the executing thread is finished. | `#pragma omp single` | +| **master** | A master region is identical to the single region other than that execution is done by the designated master thread (usually thread 0). | `#pragma omp master` | -The next example builds on the previous example which included a lookup table. In the the modified code, the lookup +The next example builds on the previous example which included a lookup table. In the modified code, the lookup table is written to disk after it has been initialised. This happens in a single region, as only one thread needs to -write the result to disk. +write the result to disk (See the full code [here](code/examples/04-single-region.c)). ```c #pragma omp parallel @@ -216,11 +230,22 @@ write the result to disk. } ``` -If we wanted to sum up something in parallel (e.g. a reduction operation), we need to use a critical region to prevent a -race condition when a threads is updating the reduction variable. For example, the code used in a previous exercise to -demonstrate a race condition can be fixed as such, +:::callout{variant='note'} +OpenMP has a restriction: you cannot use `#pragma omp single` or `#pragma omp master` directly inside a `#pragma omp parallel for` +loop. If you attempt this, you'll encounter an error because OpenMP does not allow these regions to be **"closely nested"** +within a parallel loop. However, there’s a useful workaround: move the `single` or `master` region into a separate function +and call that function from within the loop. This approach works because OpenMP allows these regions when they are not +explicitly part of the loop structure +::: + +If we wanted to sum up something in parallel (e.g., a reduction operation like summing an array), we would need to use a +critical region to prevent a race condition when threads update the reduction variable-the shared variable that stores +the final result. In the 'Identifying Race Conditions' challenge earlier, we saw that multiple threads +updating the same variable (**value**) at the same time caused inconsistencies—a classic race condition. +This problem can be fixed by using a critical region, which allows threads to update **value** one at a time. For example: ```c + int value = 0; #pragma omp parallel for for (int i = 0; i < NUM_TIMES; ++i) { @@ -231,24 +256,57 @@ for (int i = 0; i < NUM_TIMES; ++i) { } ``` -As we've added the critical region, only one thread can access and increment `value` at one time. This prevents the race -condition from earlier, because multiple threads no longer are able to read (and modify) the same variable before -other threads have finished with it. However in reality we shouldn't write a reduction like this, but would use the -[reduction -clause](https://www.intel.com/content/www/us/en/docs/advisor/user-guide/2023-0/openmp-reduction-operations.html) in the -`parallel for` directive, e.g. `#pragma omp parallel for reduction(+:value)` +However, while this approach eliminates the race condition, it introduces synchronisation overhead. +For lightweight operations like summing values, this overhead can outweigh the benefits of parallelisation. -::::challenge{id=reportingprogress, title="Reporting progress"} +:::callout -Create a program that updates a shared counter to track the progress of a parallel loop. Think about which type of -synchronisation region you can use. Can you think of any potential problems with your implementation, what happens -when you use different loop schedulers? You can use the code example below as your starting point. +### Reduction Clauses -NB: To compile this you’ll need to add `-lm` to inform the linker to link to the math C library, e.g. +A more efficient way to handle tasks like summing values is to use OpenMP's `reduction` clause. +Unlike the critical region approach, the `reduction` clause avoids explicit synchronisation by +allowing each thread to work on its own private copy of the variable. Once the loop finishes, +OpenMP combines these private copies into a single result. This not only simplifies the code but also avoids delays +caused by threads waiting to access the shared variable. + +For example, instead of using a critical region to sum values, we can rewrite the code with a `reduction` clause +as shown below: + +```c +#include +#include + +#define NUM_TIMES 10000 + +int main() { + int value = 0; + + #pragma omp parallel for reduction(+:value) + for (int i = 0; i < NUM_TIMES; ++i) { + value += 1; + } + + printf("Final value: %d\n", value); + + return 0; +} -```bash -gcc counter.c -o counter -fopenmp -lm ``` +Here, the `reduction(+:value)` directive does the work for us. During the loop, each thread maintains its +own copy of value, avoiding any chance of a race condition. When the loop ends, OpenMP automatically sums +up the private copies into the shared variable value. This means the output will always be correct—in this case, **10000**. +::: + +::::challenge{id=reportingprogress, title="Reporting progress"} +The code below attempts to track the progress of a parallel loop using a shared counter, `progress`. +However, it has a problem: the final value of progress is often incorrect, and the progress updates might be +inconsistent. +- Can you identify the issue with the current implementation? +- How would you fix it to ensure the progress counter works correctly and updates are synchronised? +- After fixing the issue, experiment with different loop schedulers (`static`, `dynamic`, `guided` and `auto`) +to observe how they affect progress reporting. + - What changes do you notice in the timing and sequence of updates when using these schedulers? + - Which scheduler produces the most predictable progress updates? ```c #include @@ -258,59 +316,120 @@ gcc counter.c -o counter -fopenmp -lm int main(int argc, char **argv) { int array[NUM_ELEMENTS] = {0}; + int progress = 0; #pragma omp parallel for schedule(static) for (int i = 0; i < NUM_ELEMENTS; ++i) { array[i] = log(i) * cos(3.142 * i); + + progress++; } + printf("Final progress: %d (Expected: %d)\n", progress, NUM_ELEMENTS); return 0; } ``` +NB: To compile this you’ll need to add `-lm` to inform the linker to link to the math C library, e.g. + +```bash +gcc counter.c -o counter -fopenmp -lm +``` :::solution +The above program tracks progress using a shared counter (`progress++`) inside the loop, +but it does so without synchronisation, leading to a race condition. +Since multiple threads can access and modify progress at the same time, the final value of progress will likely be incorrect. +This happens because the updates to progress are not synchronised across threads. As a result, the final value of +`progress` is often incorrect and varies across runs. You might see output like this: -To implement a progress bar, we have created two new variables: `progress` and `output_frequency`. We use `progress` -to track the number of iterations completed across all threads. To prevent a race condition, we increment progress -in a critical region. In the same critical region, we print the progress report out to screen whenever `progress` is -divisible by `output_frequency`. +```text +Final progress: 9983 (Expected: 10000) +``` + +To fix this issue, we use a critical region to synchronise updates to progress. +We also introduce a second variable, `output_frequency`, to control how often progress updates are reported +(e.g., every 10% of the total iterations). + +The corrected version: ```c #include #include #include -#define NUM_ELEMENTS 1000 +#define NUM_ELEMENTS 10000 int main(int argc, char **argv) { int array[NUM_ELEMENTS] = {0}; - int progress = 0; - int output_frequency = NUM_ELEMENTS / 10; /* output every 10% */ + int output_frequency = NUM_ELEMENTS / 10; /* Output progress every 10% */ #pragma omp parallel for schedule(static) for (int i = 0; i < NUM_ELEMENTS; ++i) { array[i] = log(i) * cos(3.142 * i); - #pragma omp critical - { - progress++; - if (progress % output_frequency == 0) { - int thread_id = omp_get_thread_num(); - printf("Thread %d: overall progress %3.0f%%\n", thread_id, - (double)progress / NUM_ELEMENTS * 100.0); - } + /* Update progress counter (with synchronisation) */ + #pragma omp critical + { + progress++; + if (progress % output_frequency == 0) { + printf("Progress: %d%%\n", (progress * 100) / NUM_ELEMENTS); + } + } } - } - return 0; + printf("Final progress: %d (Expected: %d)\n", progress, NUM_ELEMENTS); + return 0; } ``` +This implementation resolves the race condition by ensuring that only one thread can modify progress at a time. +However, this solution comes at a cost: **synchronisation overhead**. Every iteration requires threads to enter the +critical region, and if the loop body is lightweight (e.g., simple calculations), this overhead may outweigh the +computational benefits of parallelisation. For example, if each iteration takes only a few nanoseconds to compute, +the time spent waiting for access to the critical region might dominate the runtime. + +### Behaviour with different schedulers + +The static scheduler, used in the corrected version, divides iterations evenly among threads. This ensures predictable +and consistent progress updates. For instance, progress increments occur at regular intervals (e.g., 10%, 20%, etc.), +producing output like: + +``` +Progress: 10% +Progress: 20% +Progress: 30% +... +Final progress: 10000 (Expected: 10000) +``` + +When experimenting with other schedulers, such as `dynamic` or `guided`, +the timing and sequence of updates change due to differences in how iterations are assigned to threads. + +With the `dynamic` scheduler, threads are assigned smaller chunks of iterations as they finish their current work. +This can lead to progress updates appearing irregular, as threads complete their chunks at varying speeds based on +workload. For example: + +``` +Progress: 15% +Progress: 30% +Progress: 55% +... +Final progress: 10000 (Expected: 10000) +``` +Using the `guided` scheduler results in yet another pattern. Threads start with larger chunks of iterations, +and the chunk size decreases as the loop progresses. This often leads to progress updates being sparse at the start but +becoming more frequent toward the end of the loop: -One problem with this implementation is that tracking progress like this introduces a synchronisation overhead at -the end of each iteration, because of the critical region. In small loops like this, there's usually no reason to -track progress as the synchronisation overheads could be more significant than the time required to calculate each -array element! +``` +Progress: 25% +Progress: 70% +Progress: 100% +Final progress: 10000 (Expected: 10000) +``` +The `auto` scheduler, on the other hand, leaves the decision about iteration assignment to the OpenMP runtime system. +This provides flexibility, as the runtime adapts the scheduling to optimise for the specific platform and workload. +However, because `auto` is implementation-dependent, the timing and predictability of progress updates can vary and +are harder to generalise. ::: :::: @@ -318,8 +437,8 @@ array element! A large amount of the time spent writing a parallel OpenMP application is usually spent preventing race conditions, rather than on the parallelisation itself. Earlier in the episode, we looked at critical regions as a way to synchronise -threads and explored how be used to prevent race conditions in the previous exercise. In the rest of this section, we -will look at the other mechanisms which can prevent race conditions, namely by setting locks or by using atomic +threads and explored how they can be used to prevent race conditions in the previous exercise. In the rest of this section, we +will look at the other mechanisms which can prevent race conditions, such as setting locks or using atomic operations. ### Locks @@ -328,14 +447,14 @@ Critical regions provide a convenient and straightforward way to synchronise thr race conditions. But in some cases, critical regions may not be flexible or granular enough and lead to an excessive amount of serialisation. If this is the case, we can use *locks* instead to achieve the same effect as a critical region. Locks are a mechanism in OpenMP which, just like a critical regions, create regions in our code which only one -thread can be in at one time. The main advantage of locks, over a critical region, is that we can be far more flexible -with locks to protect different sized or fragmented regions of code, giving us more granular control over thread -synchronisation. Locks are also far more flexible when it comes to making our code more modular, as it is possible to +thread can be in at one time. The main advantage of locks, compared to critical regions, is that they provide more +granular control over thread synchronisation by protecting different-sized or fragmented regions of code, therefore +allowing greater flexibility. Locks are also far more flexible when it comes to making our code more modular, as it is possible to nest locks, or for accessing and modifying global variables. In comparison to critical regions, however, locks are more complicated and difficult to use. Instead of using a single `#pragma`, we have to initialise and free resources used for the locks, as well as set and unset where locks are in -effect. If we make a mistake and forget to unset a lock, then we lose all of the parallelism and could potentially +effect. If we make a mistake and forget to unset a lock, then we lose all the parallelism and could potentially create a deadlock! To create a lock and delete a lock, we use `omp_init_lock()` and `omp_destroy_lock()` respectively. @@ -389,36 +508,47 @@ forget to or unset a lock in the wrong place. ### Atomic operations -Another mechanism are atomic operations. In computing, an atomic operation is an operation which is performed without -interrupted, meaning that one initiated, they are guaranteed to execute without interference from other operations. In -OpenMP, this means atomic operations are operations which are done without interference from other threads. If we make -modifying some value in an array atomic, then it's guaranteed, by the compiler, that no other thread can read or modify -that array until the atomic operation is finished. You can think of it as a thread having, temporary, exclusive access -to something in our program. Sort of like a "one at a time" rule for accessing and modifying parts of the program. +Another mechanism is atomic operations. In computing, an atomic operation is an operation which is performed without +interruption, meaning that once initiated, it is guaranteed to execute without interference from other operations. +In OpenMP, this refers to operations that execute without interference from other threads. If we make an operation modifying a value in +an array atomic, the compiler guarantees that no other thread can read or modify that array until the atomic operation is +finished. You can think of it as a thread having temporary exclusive access to something in our program, similar to a +'one at a time' rule for accessing and modifying parts of the program. To do an atomic operation, we use the `omp atomic` pragma before the operation we want to make atomic. ```c -int shared_variable = 0; -int shared_array[4] = {0, 0, 0, 0}; +#include +#include -/* Put the pragma before the shared variable */ -#pragma omp parallel -{ - #pragma omp atomic - shared_variable += 1; -} +int main() { + + int shared_variable = 0; + int shared_array[4] = {0, 0, 0, 0}; + + /* Put the pragma before the shared variable */ + #pragma omp parallel + { + #pragma omp atomic + shared_variable += 1; + printf("Shared variable updated: %d\n", shared_variable); + } + + /* Can also use in a parallel for */ + #pragma omp parallel for + for (int i = 0; i < 4; ++i) { + #pragma omp atomic + shared_array[i] += 1; + printf("Shared array element %d updated: %d\n", i, shared_array[i]); + + } -/* Can also use in a parallel for */ -#pragma omp parallel for -for (int i = 0; i < 4; ++i) { - #pragma omp atomic - shared_array[i] += 1; } + ``` Atomic operations are for single line operations or piece of code. As in the example above, we can do an atomic -operation when we are updating variable but we can also do other things such as atomic assignment. Atomic operations are +operation when we are updating variable, but we can also do other things such as atomic assignment. Atomic operations are often less expensive than critical regions or locks, so they should be preferred when they can be used. However, it's still important to not be over-zealous with using atomic operations as they can still introduce synchronisation overheads which can damage the parallel performance. @@ -439,7 +569,8 @@ Critical regions and locks are more appropriate when: Atomic operations are good when: -- The operation which needs synchronisation is simple, such as needing to protect a single variable update in the parallel algorithm. +- The operation which needs synchronisation is simple, such as needing to protect a single variable update in the parallel +algorithm. - There is low contention for shared data. - When you need to be as performant as possible, as atomic operations generally have the lowest performance cost. @@ -483,7 +614,7 @@ int main(int argc, char **argv) { When we run the program multiple times, we expect the output `sum` to have the value of `0.000000`. However, due to an existing race condition, the program can sometimes produce wrong output in different runs, as shown below: -```text +``` 1. Sum: 1.000000 2. Sum: -1.000000 3. Sum: 2.000000 @@ -491,7 +622,7 @@ existing race condition, the program can sometimes produce wrong output in diffe 5. Sum: 2.000000 ``` -Find and fix the race condition in the program. Try using both an atomic operation and by using locks. +Find and fix the race condition in the program using both an atomic operation and locks. :::solution @@ -545,4 +676,4 @@ for (int i = 0; i < ARRAY_SIZE; ++i) { ``` ::: -:::: +:::: \ No newline at end of file From 8edc325c339356bffaaf71f095bc14b5ee91e20a Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 12:59:16 +0000 Subject: [PATCH 06/34] Episode 4 code updates: corrected indentation in race condition lock solution --- .../code/solutions/04-race-condition-lock.c | 38 ++++++++++--------- 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/high_performance_computing/hpc_openmp/code/solutions/04-race-condition-lock.c b/high_performance_computing/hpc_openmp/code/solutions/04-race-condition-lock.c index 3c43cbea..8b79e244 100644 --- a/high_performance_computing/hpc_openmp/code/solutions/04-race-condition-lock.c +++ b/high_performance_computing/hpc_openmp/code/solutions/04-race-condition-lock.c @@ -5,29 +5,31 @@ #define ARRAY_SIZE 524288 int main(int argc, char **argv) { - float sum = 0; - float array[ARRAY_SIZE]; + float sum = 0; + float array[ARRAY_SIZE]; - omp_set_num_threads(4); + omp_set_num_threads(4); -#pragma omp parallel for schedule(static) - for (int i = 0; i < ARRAY_SIZE; ++i) { - array[i] = cos(M_PI * i); - } + #pragma omp parallel for schedule(static) + for (int i = 0; i < ARRAY_SIZE; ++i) { + array[i] = cos(M_PI * i); + } - omp_lock_t lock; - omp_init_lock(&lock); + omp_lock_t lock; + omp_init_lock(&lock); -#pragma omp parallel for schedule(static) - for (int i = 0; i < ARRAY_SIZE; i++) { - omp_set_lock(&lock); - sum += array[i]; - omp_unset_lock(&lock); - } - printf("Sum: %f\n", sum); + #pragma omp parallel for schedule(static) + for (int i = 0; i < ARRAY_SIZE; i++) { + omp_set_lock(&lock); + sum += array[i]; + omp_unset_lock(&lock); + } - omp_destroy_lock(&lock); - return 0; + printf("Sum: %f\n", sum); + + omp_destroy_lock(&lock); + + return 0; } \ No newline at end of file From a5638f95b74a116f18477f2a5ee53c8d4af3b75f Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:02:30 +0000 Subject: [PATCH 07/34] Episode 4 code updates: added single region example full code --- .../code/examples/04-single-region.c | 49 +++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 high_performance_computing/hpc_openmp/code/examples/04-single-region.c diff --git a/high_performance_computing/hpc_openmp/code/examples/04-single-region.c b/high_performance_computing/hpc_openmp/code/examples/04-single-region.c new file mode 100644 index 00000000..726a8dc2 --- /dev/null +++ b/high_performance_computing/hpc_openmp/code/examples/04-single-region.c @@ -0,0 +1,49 @@ +#include +#include + +#define TABLE_SIZE 8 + + +void initialise_lookup_table(int thread_id, double lookup_table[TABLE_SIZE]) { + int num_threads = omp_get_num_threads(); + for (int i = thread_id; i < TABLE_SIZE; i += num_threads) { + lookup_table[i] = thread_id * 2; + printf("Thread %d initializing lookup_table[%d] = %f\n", thread_id, i, lookup_table[i]); + } +} + + +void write_table_to_disk(double lookup_table[TABLE_SIZE]) { + printf("Writing lookup table to disk:\n"); + for (int i = 0; i < TABLE_SIZE; ++i) { + printf("lookup_table[%d] = %f\n", i, lookup_table[i]); + } +} + + +void do_main_calculation(int thread_id) { + printf("Thread %d performing its main calculation.\n", thread_id); +} + +int main() { + double lookup_table[TABLE_SIZE] = {0}; + + #pragma omp parallel + { + int thread_id = omp_get_thread_num(); + + + initialise_lookup_table(thread_id, lookup_table); + + #pragma omp barrier + + #pragma omp single + { + write_table_to_disk(lookup_table); + } + + do_main_calculation(thread_id); + } + + return 0; +} \ No newline at end of file From 21f135dfdcf14229328878fdeed99c33a87e41b5 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:04:32 +0000 Subject: [PATCH 08/34] Episode 4 code updates: added matrix update example full code --- .../code/examples/04-matrix-update.c | 41 +++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c diff --git a/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c b/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c new file mode 100644 index 00000000..f43bf1c5 --- /dev/null +++ b/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c @@ -0,0 +1,41 @@ +#include +#include + +#define NX 4 +#define NY 4 +#define NUM_ITERATIONS 10 + +void iterate_matrix_solution(double old_matrix[NX][NY], double new_matrix[NX][NY], int thread_id) { + for (int j = 0; j < NY; ++j) { + new_matrix[thread_id][j] = old_matrix[thread_id][j] + 1; + } +} + +void copy_matrix(double src[NX][NY], double dest[NX][NY]) { + for (int i = 0; i < NX; ++i) { + for (int j = 0; j < NY; ++j) { + dest[i][j] = src[i][j]; + } + } +} + +int main() { + double old_matrix[NX][NY] = {{0}}; + double new_matrix[NX][NY] = {{0}}; + + #pragma omp parallel + { + for (int i = 0; i < NUM_ITERATIONS; ++i) { + int thread_id = omp_get_thread_num(); + + iterate_matrix_solution(old_matrix, new_matrix, thread_id); /* Each thread updates a portion of the matrix */ + + #pragma omp barrier /* Wait until all threads complete updates to new_matrix */ + + copy_matrix(new_matrix, old_matrix); + } + } + + printf("Matrix update complete.\n"); + return 0; +} \ No newline at end of file From f4bfb83df4f6adc2a13adcc0c02d98da38a61239 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:06:06 +0000 Subject: [PATCH 09/34] Episode 4 code updates: added barrier example code --- .../hpc_openmp/code/examples/04-barriers.c | 42 +++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 high_performance_computing/hpc_openmp/code/examples/04-barriers.c diff --git a/high_performance_computing/hpc_openmp/code/examples/04-barriers.c b/high_performance_computing/hpc_openmp/code/examples/04-barriers.c new file mode 100644 index 00000000..02693d78 --- /dev/null +++ b/high_performance_computing/hpc_openmp/code/examples/04-barriers.c @@ -0,0 +1,42 @@ +#include +#include + +#define TABLE_SIZE 8 + +/* Function to initialise the lookup table */ +void initialise_lookup_table(int thread_id, double lookup_table[TABLE_SIZE]) { + int num_threads = omp_get_num_threads(); + for (int i = thread_id; i < TABLE_SIZE; i += num_threads) { + lookup_table[i] = thread_id * 2; /* Each thread initialises its own portion */ + printf("Thread %d initializing lookup_table[%d] = %f\n", thread_id, i, lookup_table[i]); + } +} + +/* Function to perform the main calculation */ +void do_main_calculation(int thread_id, double lookup_table[TABLE_SIZE]) { + int num_threads = omp_get_num_threads(); + for (int i = thread_id; i < TABLE_SIZE; i += num_threads) { + printf("Thread %d processing lookup_table[%d] = %f\n", thread_id, i, lookup_table[i]); + } +} + +int main() { + double lookup_table[TABLE_SIZE] = {0}; /* Initialise the lookup table to zeros */ + + #pragma omp parallel + { + int thread_id = omp_get_thread_num(); + + /* The initialisation of the lookup table is done in parallel */ + initialise_lookup_table(thread_id, lookup_table); + + #pragma omp barrier /* As all threads depend on the table, we have to wait until all threads + are done and have reached the barrier */ + + + /* Each thread then proceeds to its main calculation */ + do_main_calculation(thread_id, lookup_table); + } + + return 0; +} \ No newline at end of file From 6bb3338ac2d32017c6dd8aeae5bd1b08862c809a Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:08:46 +0000 Subject: [PATCH 10/34] Episode 5 updates: Fixed some typos --- .../hpc_openmp/05_hybrid_parallelism.md | 28 +++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md index 80aaae67..ab1c3312 100644 --- a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md +++ b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md @@ -58,7 +58,7 @@ the data that threads in another MPI process have access to due to each MPI proc still possible to communicate thread-to-thread, but we have to be very careful and explicitly set up communication between specific threads using the parent MPI processes. -As an example of how resources could be split using an MPI+OpenMP approach, consider a HPC cluster with some number of +As an example of how resources could be split using an MPI+OpenMP approach, consider an HPC cluster with some number of compute nodes with each having 64 CPU cores. One approach would be to spawn one MPI process per rank which spawns 64 OpenMP threads, or 2 MPI processes which both spawn 32 OpenMP threads, and so on and so forth. @@ -66,7 +66,7 @@ OpenMP threads, or 2 MPI processes which both spawn 32 OpenMP threads, and so on #### Improved memory efficiency -Since MPI processes each have their own private memory space, there is almost aways some data replication. This could be +Since MPI processes each have their own private memory space, there is almost always some data replication. This could be on small pieces of data, such as some physical constants each MPI rank needs, or it could be large pieces of data such a grid of data or a large dataset. When there is large data being replicated in each rank, the memory requirements of an MPI program can rapidly increase making it unfeasible to run on some systems. In an OpenMP application, we don't have to @@ -82,7 +82,7 @@ can more easily control the work balance, in comparison to a pure MPI implementa schedulers to address imbalance on a node. There is typically also a reduction in communication overheads, as there is no communication required between threads (although this overhead may be replaced by thread synchronisation overheads) which can improve the performance of algorithms which previously required communication such as those which require -exchanging data between overlapping sub-domains (halo exchange). +exchanging data between overlapping subdomains (halo exchange). ### Disadvantages @@ -90,7 +90,7 @@ exchanging data between overlapping sub-domains (halo exchange). Writing *correct* and efficient parallel code in pure MPI and pure OpenMP is hard enough, so combining both of them is, naturally, even more difficult to write and maintain. Most of the difficulty comes from having to combine both -parallelism models in an easy to read and maintainable fashion, as the interplay between the two parallelism models adds +parallelism models in an easy-to-read and maintainable fashion, as the interplay between the two parallelism models adds complexity to the code we write. We also have to ensure we do not introduce any race conditions, making sure to synchronise threads and ranks correctly and at the correct parts of the program. Finally, because we are using two parallelism models, MPI+OpenMP code bases are larger than a pure MPI or OpenMP version, making the overall @@ -114,13 +114,13 @@ Most of this can, however, be mitigated with good documentation and a robust bui So, when should we use a hybrid scheme? A hybrid scheme is particularly beneficial in scenarios where you need to leverage the strength of both the shared and distributed-memory parallelism paradigms. MPI is used to exploit lots of -resources across nodes on a HPC cluster, whilst OpenMP is used to efficiently (and somewhat easily) parallelise the work +resources across nodes on an HPC cluster, whilst OpenMP is used to efficiently (and somewhat easily) parallelise the work each MPI task is required to do. The most common reason for using a hybrid scheme is for large-scale simulations, where the workload doesn't fit or work efficiently in a pure MPI or OpenMP implementation. This could be because of memory constraints due to data replication, or due to poor/complex workload balance which are difficult to handle in MPI, or because of inefficient data access -patterns from how ranks are coordinated. Of course, your mileage may vary and it is not always appropriate to use a +patterns from how ranks are coordinated. Of course, your mileage may vary, and it is not always appropriate to use a hybrid scheme. It could be better to think about other ways or optimisations to decrease overheads and memory requirements, or to take a different approach to improve the work balance. @@ -134,12 +134,12 @@ parallelised. Specifically, we will write a program to solve the integral to com $$ \int_{0}^{1} \frac{4}{1 + x^{2}} ~ \mathrm{d}x = 4 \tan^{-1}(x) = \pi $$ There are a plethora of methods available to numerically evaluate this integral. To keep the problem simple, we will -re-cast the integral into a easier-to-code summation. How we got here isn't that important for our purposes, but what we +re-cast the integral into an easier-to-code summation. How we got here isn't that important for our purposes, but what we will be implementing in code is the following summation, $$ \pi = \lim_{n \to \infty} \sum_{i = 0}^{n} \frac{1}{n} ~ \frac{4}{1 + x_{i}^{2}} $$ -where $x_{i}$ is the the midpoint of the $i$-th rectangle. To get an accurate approximation of $\pi$, we'll need to +where $x_{i}$ is the midpoint of the $i$-th rectangle. To get an accurate approximation of $\pi$, we'll need to split the domain into a large number of smaller rectangles. ### A simple parallel implementation using OpenMP @@ -196,7 +196,7 @@ Calculated pi 3.141593 error 0.000000 Total time = 34.826832 seconds ``` -You should see that we've compute an accurate approximation of $\pi$, but it also took a very long time at 35 seconds! +You should see that we've computed an accurate approximation of $\pi$, but it also took a very long time at 35 seconds! To speed this up, let's first parallelise this using OpenMP. All we need to do, for this simple application, is to use a `parallel for` to split the loop between OpenMP threads as shown below. @@ -231,9 +231,9 @@ Total time = 5.166490 seconds ### A hybrid implementation using MPI and OpenMP Now that we have a working parallel implementation using OpenMP, we can now expand our code to a hybrid parallel code by -implementing MPI. In this example, we can porting an OpenMP code to a hybrid MPI+OpenMP application but we could have +implementing MPI. In this example, we wil be porting an OpenMP code to a hybrid MPI+OpenMP application, but we could have also done this the other way around by porting an MPI code into a hybrid application. Neither *"evolution"* is more -common or better than the other, the route each code takes toward becoming hybrid is different. +common nor better than the other, the route each code takes toward becoming hybrid is different. So, how do we split work using a hybrid approach? For an embarrassingly parallel problem, such as the one we're working on, we can split the problem size into smaller chunks across MPI ranks and use OpenMP to parallelise the work. For example, consider @@ -378,8 +378,8 @@ Total time = 5.818889 seconds Ouch, this took longer to run than the pure OpenMP implementation (although only marginally longer in this example!). You may have noticed that we have 8 MPI ranks, each of which are spawning 8 of their own OpenMP threads. This is an important thing to realise. When you specify the number of threads for OpenMP to use, this is the number of threads -*each* MPI process will spawn. So why did it take longer? With each of the 8 MPI ranks spawning 8 threads, 64 threads -threads were in flight. More threads means more overheads and if, for instance, we have 8 CPU Cores, then contention +*each* MPI process will spawn. So why did it take longer? With each of the 8 MPI ranks spawning 8 threads, 64 threads +were in flight. More threads means more overheads and if, for instance, we have 8 CPU Cores, then contention arises as each thread competes for access to a CPU core. Let's improve this situation by using a combination of rank and threads so that $N_{\mathrm{ranks}} N_{\mathrm{threads}} @@ -448,4 +448,4 @@ was, rather naturally, when either $N_{\mathrm{ranks}} = 1$, $N_{\mathrm{threads $N_{\mathrm{threads}} = 1$ with the former being slightly faster. Otherwise, we found the best balance was $N_{\mathrm{ranks}} = 2$, $N_{\mathrm{threads}} = 3$. ::: -:::: +:::: \ No newline at end of file From bbd94bcf2cd21024e9177fdec5b760606742d293 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:47:41 +0000 Subject: [PATCH 11/34] Episode 3 updates: Fixed some typos and revised the text discussing deadlocks --- .../hpc_mpi/03_communicating_data.md | 52 +++++++++++-------- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/high_performance_computing/hpc_mpi/03_communicating_data.md b/high_performance_computing/hpc_mpi/03_communicating_data.md index 3cf757ad..fd768d90 100644 --- a/high_performance_computing/hpc_mpi/03_communicating_data.md +++ b/high_performance_computing/hpc_mpi/03_communicating_data.md @@ -27,7 +27,7 @@ access data from one rank to another, we use the MPI API to transfer data in a " which contains the data we want to send, and is usually expressed as a collection of data elements of a particular data type. -Sending and receiving data can happen happen in two patterns. We either want to send data from one specific rank to +Sending and receiving data can happen in two patterns. We either want to send data from one specific rank to another, known as point-to-point communication, or to/from multiple ranks all at once to a single or multiples targets, known as collective communication. Whatever we do, we always have to *explicitly* "send" something and to *explicitly* "receive" something. Data communication can't happen by itself. A rank can't just get data from one rank, and ranks @@ -60,8 +60,8 @@ When we send a message, MPI needs to know the size of the data being transferred data being sent, as you may expect, but is instead the number of elements of a specific data type being sent. When we send a message, we have to tell MPI how many elements of "something" we are sending and what type of data it is. If we don't do this correctly, we'll either end up telling MPI to send only *some* of the data or try to send more data than -we want! For example, if we were sending an array and we specify too few elements, then only a subset of the array will -be sent or received. But if we specify too many elements, than we are likely to end up with either a segmentation fault +we want! For example, if we were sending an array, and we specify too few elements, then only a subset of the array will +be sent or received. But if we specify too many elements, then we are likely to end up with either a segmentation fault or undefined behaviour! And the same can happen if we don't specify the correct data type. There are two types of data type in MPI: "basic" data types and derived data types. The basic data types are in essence @@ -70,7 +70,7 @@ primitive C types in its API, using instead a set of constants which internally types are in the table below: | MPI basic data type | C equivalent | -| ---------------------- | ---------------------- | +|------------------------|------------------------| | MPI_SHORT | short int | | MPI_INT | int | | MPI_LONG | long int | @@ -91,8 +91,8 @@ Remember, these constants aren't the same as the primitive types in C, so we can MPI_INT my_int = 1; ``` -is not valid code because, under the hood, these constants are actually special data structures used by MPI. Therefore -we can only them as arguments in MPI functions. +is not valid code because, under the hood, these constants are actually special data structures used by MPI. +Therefore, we can only them as arguments in MPI functions. ::::callout @@ -113,7 +113,9 @@ INT_TYPE my_int = 1; ``` :::: -Derived data types, on the other hand, are very similar to C structures which we define by using the basic MPI data types. They're often useful to group together similar data in communications, or when you need to send a structure from one rank to another. This is covered in more detail in the optional [Advanced Communication Techniques](../hpc_mpi/11_advanced_communication.md) episode. +Derived data types, on the other hand, are very similar to C structures which we define by using the basic MPI data types. +They are often useful for grouping similar data in communications or when sending a structure from one rank to another. +This is covered in more detail in the optional [Advanced Communication Techniques](11_advanced_communication.md) episode. :::::challenge{id=what-type, title="What Type Should You Use?"} For the following pieces of data, what MPI data types should you use? @@ -124,7 +126,7 @@ For the following pieces of data, what MPI data types should you use? ::::solution -The fact that `a[]` is an array does not matter, because all of the elemnts in `a[]` will be the same data type. In MPI, as we'll see in the next episode, we can either send a single value or multiple values (in an array). +The fact that `a[]` is an array does not matter, because all the elements in `a[]` will be the same data type. In MPI, as we'll see in the next episode, we can either send a single value or multiple values (in an array). 1. `MPI_INT` 2. `MPI_DOUBLE` - `MPI_FLOAT` would not be correct as `float`'s contain 32 bits of data whereas `double`s are 64 bit. @@ -150,11 +152,12 @@ In addition to `MPI_COMM_WORLD`, we can make sub-communicators and distribute ra ## Communication modes -When sending data between ranks, MPI will use one of four communication modes: synchronous, buffered, ready or standard. When a communication function is called, it takes control of program execution until the send-buffer is safe to be re-used again. What this means is that it's safe to re-use the memory/variable you passed without affecting the data that is still being sent. If MPI didn't have this concept of safety, then you could quite easily overwrite or destroy any data before it is transferred fully! This would lead to some very strange behaviour which would be hard to debug. The difference between the communication mode is when the buffer becomes safe to re-use. MPI won't guess at which mode *should* be used. That is up to the programmer. Therefore each mode has an associated communication function: +When sending data between ranks, MPI will use one of four communication modes: synchronous, buffered, ready or standard. When a communication function is called, it takes control of program execution until the send-buffer is safe to be re-used again. What this means is that it's safe to re-use the memory/variable you passed without affecting the data that is still being sent. If MPI didn't have this concept of safety, then you could quite easily overwrite or destroy any data before it is transferred fully! This would lead to some very strange behaviour which would be hard to debug. The difference between the communication mode is when the buffer becomes safe to re-use. MPI won't guess at which mode *should* be used. +That is up to the programmer. Therefore, each mode has an associated communication function: | Mode | Blocking function | -| ----------- | ----------------- | +|-------------|-------------------| | Synchronous | `MPI_SSend()` | | Buffered | `MPI_Bsend()` | | Ready | `MPI_Rsend()` | @@ -163,7 +166,7 @@ When sending data between ranks, MPI will use one of four communication modes: s In contrast to the four modes for sending data, receiving data only has one mode and therefore only a single function. | Mode | MPI Function | -| ------- | ------------ | +|---------|--------------| | Receive | `MPI_Recv()` | ### Synchronous sends @@ -203,7 +206,8 @@ then you may have to repeat yourself to make sure your transmit the information ### Standard sends -The standard send mode is the most commonly used type of send, as it provides a balance between ease of use and performance. Under the hood, the standard send is either a buffered or a synchronous send, depending on the availability of system resources (e.g. the size of the internal buffer) and which mode MPI has determined to be the most efficient. +The standard send mode is the most commonly used type of send, as it provides a balance between ease of use and performance. +Under the hood, the standard send is either a buffered or a synchronous send, depending on the availability of system resources (e.g. the size of the internal buffer) and which mode MPI has determined to be the most efficient. ::::callout @@ -217,9 +221,9 @@ the standard send, `MPI_Send()`, and optimise later. If the standard send doesn' ## Communication mode summary: -|Mode | Description | Analogy | MPI Function | -| --- | ----------- | ------- | ------------ | -| Synchronous | Returns control to the program when the message has been sent and received successfully. | Making a phone call | `MPI_Ssend()`| +| Mode | Description | Analogy | MPI Function | +|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------| +| Synchronous | Returns control to the program when the message has been sent and received successfully. | Making a phone call | `MPI_Ssend()` | | Buffered | Returns control immediately after copying the message to a buffer, regardless of whether the receive has happened or not. | Sending a letter or e-mail | `MPI_Bsend()` | | Ready | Returns control immediately, assuming the matching receive has already been posted. Can lead to errors if the receive is not ready. | Talking to someone you think/hope is listening | `MPI_Rsend()` | | Standard | Returns control when it's safe to reuse the send buffer. May or may not wait for the matching receive (synchronous mode), depending on MPI implementation and message size. | Phone call or letter | `MPI_Send()` | @@ -228,7 +232,7 @@ the standard send, `MPI_Send()`, and optimise later. If the standard send doesn' ### Blocking vs. non-blocking communication In addition to the communication modes, communication is done in two ways: either by blocking execution -until the communication is complete (like how a synchronous send blocks until an receive acknowledgment is sent back), +until the communication is complete (like how a synchronous send blocks until a **receive** acknowledgment is sent back), or by returning immediately before any part of the communication has finished, with non-blocking communication. Just like with the different communication modes, MPI doesn't decide if it should use blocking or non-blocking communication calls. That is, again, up to the programmer to decide. As we'll see in later episodes, there are different functions @@ -249,11 +253,12 @@ The buffered communication mode is a type of asynchronous communication, because `MPI_Ibsend()` (more on this later). Even though the data transfer happens in the background, allocating and copying data to the send buffer happens in the foreground, blocking execution of our program. On the other hand, `MPI_Ibsend()` is "fully" asynchronous because even allocating and copying data to the send buffer happens in the background. ::: -One downside to blocking communication is that if rank B is never listening for messages, rank A will become *deadlocked*. A deadlock happens when your program hangs indefinitely because the send (or receive) is unable to -complete. Deadlocks occur for a countless number of reasons. For example, we may forget to write the corresponding -receive function when sending data. Or a function may return earlier due to an error which isn't handled properly, or a -while condition may never be met creating an infinite loop. Ranks can also can silently, making communication with them -impossible, but this doesn't stop any attempts to send data to crashed rank. +One downside to blocking communication is that if rank B is never listening for messages, rank A will become *deadlocked*. A deadlock +happens when your program hangs indefinitely because the **send** (or **receive**) operation is unable to +complete. Deadlocks can happen for countless number of reasons. For example, we might forget to write the corresponding +**receive** function when sending data. Or a function may return earlier due to an error which isn't handled properly, or a +**while** condition may never be met creating an infinite loop. Ranks can also silently fail, making communication with them +impossible, but this does not stop any attempts to send data to crashed rank. ::::callout @@ -264,7 +269,8 @@ A common piece of advice in C is that when allocating memory using `malloc()`, a You can apply the same mantra to communication in MPI. When you send data, always write the code to receive the data as you may forget to later and accidentally cause a deadlock. :::: -Blocking communication works best when the work is balanced across ranks, so that each rank has an equal amount of things to do. A common pattern in scientific computing is to split a calculation across a grid and then to share the results between all ranks before moving onto the next calculation. If the workload is well balanced, each rank will finish at roughly the same time and be ready to transfer data at the same time. But, as shown in the diagram below, if the workload is unbalanced, some ranks will finish their calculations earlier and begin to send their data to the other ranks before they are ready to receive data. This means some ranks will be sitting around doing nothing whilst they wait for the other ranks to become ready to receive data, wasting computation time. +Blocking communication works best when the work is balanced across ranks, so that each rank has an equal amount of things to do. A common pattern in scientific computing is to split a calculation across a grid and then to share the results between all ranks before moving onto the next calculation. +If the workload is well-balanced, each rank will finish at roughly the same time and be ready to transfer data at the same time. But, as shown in the diagram below, if the workload is unbalanced, some ranks will finish their calculations earlier and begin to send their data to the other ranks before they are ready to receive data. This means some ranks will be sitting around doing nothing whilst they wait for the other ranks to become ready to receive data, wasting computation time. ![Blocking communication](fig/blocking-wait.png) @@ -299,4 +305,4 @@ Until the other person responds, we are stuck waiting for the response. Sending e-mails or letters in the post is a form of non-blocking communication we're all familiar with. When we send an e-mail, or a letter, we don't wait around to hear back for a response. We instead go back to our lives and start doing tasks instead. We can periodically check our e-mail for the response, and either keep doing other tasks or continue our previous task once we've received a response back from our e-mail. :::: -::::: +::::: \ No newline at end of file From e83a850d8de58913dc2f869f99683322adc7af8b Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:53:13 +0000 Subject: [PATCH 12/34] Episode 4 updates: Fixed code examples, clarified rank independence, addressed buffer initialisation issues, and improved explanations for MPI concepts. --- .../04_point_to_point_communication.md | 109 ++++++++++++------ 1 file changed, 72 insertions(+), 37 deletions(-) diff --git a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md index 5e96b262..4e111696 100644 --- a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md +++ b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md @@ -14,7 +14,8 @@ learningOutcomes: --- In the previous episode we introduced the various types of communication in MPI. -In this section we will use the MPI library functions `MPI_Send()` and `MPI_Recv()`, which employ point-to-point communication, to send data from one rank to another. +In this section we will use the MPI library functions `MPI_Send()` and `MPI_Recv()`, which employ point-to-point communication, +to send data from one rank to another. ![Sending data from one rank to another using MPI_SSend and MPI_Recv()](fig/send-recv.png) @@ -43,14 +44,14 @@ int MPI_Send( MPI_Comm communicator ) ``` -| | | -| ------- | -------- | -| `*data`: | Pointer to the start of the data being sent. We would not expect this to change, hence it's defined as `const` | -| `count`: | Number of elements to send | -| `datatype`: | The type of the element data being sent, e.g. MPI_INTEGER, MPI_CHAR, MPI_FLOAT, MPI_DOUBLE, ... | -| `destination`: | The rank number of the rank the data will be sent to | -| `tag`: | An message tag (integer), which is used to differentiate types of messages. We can specify `0` if we don't need different types of messages | -| `communicator`: | The communicator, e.g. MPI_COMM_WORLD as seen in previous episodes | +| | | +|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------| +| `*data`: | Pointer to the start of the data being sent. We would not expect this to change, hence it's defined as `const` | +| `count`: | Number of elements to send | +| `datatype`: | The type of the element data being sent, e.g. MPI_INTEGER, MPI_CHAR, MPI_FLOAT, MPI_DOUBLE, ... | +| `destination`: | The rank number of the rank the data will be sent to | +| `tag`: | An message tag (integer), which is used to differentiate types of messages. We can specify `0` if we don't need different types of messages | +| `communicator`: | The communicator, e.g. MPI_COMM_WORLD as seen in previous episodes | For example, if we wanted to send a message that contains `"Hello, world!\n"` from rank 0 to rank 1, we could state @@ -61,7 +62,9 @@ char *message = "Hello, world!\n"; MPI_Send(message, 14, MPI_CHAR, 1, 0, MPI_COMM_WORLD); ``` -So we are sending 14 elements of `MPI_CHAR()` one time, and specified `0` for our message tag since we don't anticipate having to send more than one type of message. This call is synchronous, and will block until the corresponding `MPI_Recv()` operation receives and acknowledges receipt of the message. +So we are sending 14 elements of `MPI_CHAR()` one time, and specified `0` for our message tag since we don't anticipate +having to send more than one type of message. This call is synchronous, and will block until the corresponding `MPI_Recv()` +operation receives and acknowledges receipt of the message. ::::callout @@ -96,27 +99,29 @@ int MPI_Recv( ) ``` -| | | -| --- | ---- | -| `*data`: | Pointer to where the received data should be written | -| `count`: | Maximum number of elements to receive | -| `datatype`: | The type of the data being received | -| `source`: | The number of the rank sending the data | -| `tag`: | A message tag (integer), which must either match the tag in the sent message, or if `MPI_ANY_TAG` is specified, a message with any tag will be accepted | -| `communicator`: | The communicator (we have used `MPI_COMM_WORLD` in earlier examples) | -| `status`: | A pointer for writing the exit status of the MPI command, indicating whether the operation succeeded or failed | +| | | +|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------| +| `*data`: | Pointer to where the received data should be written | +| `count`: | Maximum number of elements to receive | +| `datatype`: | The type of the data being received | +| `source`: | The number of the rank sending the data | +| `tag`: | A message tag, which must either match the tag in the sent message, or if `MPI_ANY_TAG` is specified, a message with any tag will be accepted | +| `communicator`: | The communicator (we have used `MPI_COMM_WORLD` in earlier examples) | +| `*status`: | A pointer for writing the exit status of the MPI command, indicating whether the operation succeeded or failed | Continuing our example, to receive our message we could write: ```c -char message[15]; +char message[15] = {0}; /* Initialise teh buffer to zeros */ MPI_Status status; MPI_Recv(message, 14, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status); +message[14] = '\0'; ``` - -Here, we create our buffer to receive the data, as well as a variable to hold the exit status of the receive operation. -We then call `MPI_Recv()`, specifying our returned data buffer, the number of elements we will receive (14) which will be of type `MPI_CHAR` and sent by rank 0, with a message tag of 0. -As with `MPI_Send()`, this call will block - in this case until the message is received and acknowledgement is sent to rank 0, at which case both ranks will proceed. +Here, we create a buffer `message` to store the received data and initialise it to zeros (`{0}`) to prevent +any garbage content. We then call `MPI_Recv()` to receive the message, specifying the source rank (`0`), the message tag (`0`), +and the communicator (`MPI_COMM_WORLD`). The status object is passed to capture details about the received message, such as +the actual source rank or tag, though it is not used in this example. To ensure safe string handling, +we explicitly null-terminate the received message by setting `message[14] = '\0'`. Let's put this together with what we've learned so far. Here's an example program that uses `MPI_Send()` and `MPI_Recv()` to send the string `"Hello World!"` from rank 0 to rank 1: @@ -143,14 +148,15 @@ int main(int argc, char** argv) { MPI_Comm_rank(MPI_COMM_WORLD,&rank); if( rank == 0 ){ - char *message = "Hello, world!\n"; + constant char *message = "Hello, world!\n"; MPI_Send(message, 14, MPI_CHAR, 1, 0, MPI_COMM_WORLD); } if( rank == 1 ){ - char message[14]; + char message[15] = {0}; MPI_Status status; MPI_Recv(message, 14, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status); + message[14] = `\0`; printf("%s",message); } @@ -158,6 +164,11 @@ int main(int argc, char** argv) { return MPI_Finalize(); } ``` +:::callout{variant='warning'} +When using `MPI_Recv()` to receive string data, ensure that your buffer is large enough to hold the message and +includes space for a null terminator. Explicitly initialising the buffer and adding the null terminator avoids +undefined behavior or garbage output. +::: ::::callout @@ -246,15 +257,18 @@ int main(int argc, char** argv) { if( my_pair < n_ranks ){ if( rank%2 == 0 ){ - char *message = "Hello, world!\n"; + const char *message = "Hello, world!\n"; + printf("Rank %d sending to Rank %d\n", rank, my_pair); + fflush(stdout); MPI_Send(message, 14, MPI_CHAR, my_pair, 0, MPI_COMM_WORLD); } if( rank%2 == 1 ){ - char message[14]; - MPI_Status status; + char message[15] = {0}; + MPI_Status status; MPI_Recv(message, 14, MPI_CHAR, my_pair, 0, MPI_COMM_WORLD, &status); - printf("%s",message); + printf("Rank %d received from Rank %d: %s\n", rank, my_pair, message); + fflush(stdout); } } @@ -277,7 +291,7 @@ Have rank 0 print each message. int main(int argc, char **argv) { int rank; - int message[30]; + char message[30]; // First call MPI_Init MPI_Init(&argc, &argv); @@ -293,6 +307,10 @@ int main(int argc, char **argv) { return MPI_Finalize(); } ``` +Note: In MPI programs, every rank runs the same code. To make ranks behave differently, you must +explicitly program that behavior based on their rank ID. For example: +- Use conditionals like `if (rank == 0)` to define specific actions for rank 0. +- All other ranks can perform different actions in an `else` block. ::::solution @@ -314,6 +332,8 @@ int main(int argc, char **argv) { char message[30]; sprintf(message, "Hello World, I'm rank %d\n", rank); + printf("Rank %d is sending a message to Rank 0.\n", rank); + fflush(stdout); MPI_Send(message, 30, MPI_CHAR, 0, 0, MPI_COMM_WORLD); } else { @@ -322,9 +342,8 @@ int main(int argc, char **argv) { for( int sender = 1; sender < n_ranks; sender++ ) { char message[30]; MPI_Status status; - MPI_Recv(message, 30, MPI_CHAR, sender, 0, MPI_COMM_WORLD, &status); - printf("%s",message); + printf("Rank 0 received a message from Rank %d: %s", sender, message); } } @@ -332,9 +351,17 @@ int main(int argc, char **argv) { return MPI_Finalize(); } ``` +Here rank 0 calls `MPI_Recv` to gather messages, while other ranks call `MPI_Send` to send their messages. +Without this differentiation, ranks will attempt the same actions, potentially causing errors or deadlocks. :::: ::::: +:::callout{variant='note'} +If you don't require the additional information provided by `MPI_Status`, such as source or tag, +you can use `MPI_STATUS_IGNORE` in `MPI_Recv` calls. This simplifies your code by removing the need to declare and manage an +`MPI_Status` object. This is particularly useful in straightforward message-passing scenarios. +::: + :::::challenge{id=blocking, title="Blocking"} Try the code below and see what happens. How would you change the code to fix the problem? @@ -441,19 +468,27 @@ int main(int argc, char** argv) { { // Receive the ball MPI_Recv(&ball, 1, MPI_INT, neighbour, 0, MPI_COMM_WORLD, &status); - + + // Increment the counter and send the ball back counter += 1; MPI_Send(&ball, 1, MPI_INT, neighbour, 0, MPI_COMM_WORLD); + + // Log progress every 100,000 iterations + if (counter % 100000 == 0) { + printf("Rank %d exchanged the ball %d times\n", rank, counter); + fflush(stdout); + } + // Check if the rank is bored bored = counter >= max_count; } - printf("rank %d is bored and giving up \n", rank); - + printf("Rank %d is bored and giving up after %d exchanges\n", rank, counter); + fflush(stdout); // Call finalise at the end return MPI_Finalize(); } ``` :::: -::::: +::::: \ No newline at end of file From 22960ce99e1ff3225100a45b5537da2898b8509e Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 13:59:02 +0000 Subject: [PATCH 13/34] Episode 5 updates: Corrected typos, clarified table terminology, addressed rank-specific behaviours, added link for derived types episode, and explained matching in collective comms --- .../hpc_mpi/05_collective_communication.md | 151 ++++++++++-------- 1 file changed, 88 insertions(+), 63 deletions(-) diff --git a/high_performance_computing/hpc_mpi/05_collective_communication.md b/high_performance_computing/hpc_mpi/05_collective_communication.md index 77b6cb3c..b70bce81 100644 --- a/high_performance_computing/hpc_mpi/05_collective_communication.md +++ b/high_performance_computing/hpc_mpi/05_collective_communication.md @@ -13,7 +13,11 @@ learningOutcomes: - Learn how to use collective communication functions. --- -The previous episode showed how to send data from one rank to another using point-to-point communication. If we wanted to send data from multiple ranks to a single rank to, for example, add up the value of a variable across multiple ranks, we have to manually loop over each rank to communicate the data. This type of communication, where multiple ranks talk to one another known as called collective communication. In the code example below, point-to-point communication is used to calculate the sum of the rank numbers - feel free to try it out! +The previous episode showed how to send data from one rank to another using point-to-point communication. +If we wanted to send data from multiple ranks to a single rank to, for example, add up the value of a variable across multiple +ranks, we have to manually loop over each rank to communicate the data. This type of communication, where multiple ranks talk to +one another known as called collective communication. In the code example below, point-to-point communication is used to +calculate the sum of the rank numbers - feel free to try it out! ```c #include @@ -57,14 +61,21 @@ int main(int argc, char **argv) { } ``` -For it's use case, the code above works perfectly fine. However, it isn't very efficient when you need to communicate large amounts of data, have lots of ranks, or when the workload is uneven (due to the blocking communication). It's also a lot of code to do not much, which makes it easy to introduce mistakes in our code. A common mistake in this example would be to start the loop over ranks from 0, which would cause a deadlock! It's actually quite a common mistake for new MPI users to write something like the above. +For its use case, the code above works perfectly fine. However, it isn't very efficient when you need to communicate large +amounts of data, have lots of ranks, or when the workload is uneven (due to the blocking communication). It's also a lot of code +to do not much, which makes it easy to introduce mistakes in our code. A common mistake in this example would be to start the +loop over ranks from 0, which would cause a deadlock! It's actually quite a common mistake for new MPI users to write something +like the above. -We don't need to write code like this (unless we want *complete* control over the data communication), because MPI has access to collective communication functions to abstract all of this code for us. The above code can be replaced by a single collective communication function. Collection operations are also implemented far more efficiently in the MPI library than we could ever write using point-to-point communications. +We don't need to write code like this (unless we want *complete* control over the data communication), because MPI has +access to collective communication functions to abstract all of this code for us. The above code can be replaced by a single +collective communication function. Collection operations are also implemented far more efficiently in the MPI library than we +could ever write using point-to-point communications. There are several collective operations that are implemented in the MPI standard. The most commonly-used are: | Type | Description | -| --------------- | -------------------------------------------------------------------- | +|-----------------|----------------------------------------------------------------------| | Synchronisation | Wait until all processes have reached the same point in the program. | | One-To-All | One rank sends the same message to all other ranks. | | All-to-One | All ranks send data to a single rank. | @@ -81,13 +92,15 @@ int MPI_Barrier( MPI_Comm comm ); ``` -| | | -|------ | ------- | +| | | +|-------|--------------------------------------| | comm: | The communicator to add a barrier to | -When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. We should also avoid using barriers in parts of our program has have complicated branches, as we may introduce a deadlock by having a barrier in only one branch. +When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. +We should also avoid using barriers in parts of our program has have complicated branches, as we may introduce a deadlock by having a barrier in only one branch. -In practise, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, where each rank has its own private memory space and often resources, it's rare that we need to care about ranks becoming out-of-sync. However, one usecase is when multiple ranks need to write *sequentially* to the same file. The code example below shows how you may handle this by using a barrier. +In practise, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, +where each rank has its own private memory space and often resources, it's rare that we need to care about ranks becoming out-of-sync. However, one use case is when multiple ranks need to write *sequentially* to the same file. The code example below shows how you may handle this by using a barrier. ```c for (int i = 0; i < num_ranks; ++i) { @@ -102,7 +115,9 @@ for (int i = 0; i < num_ranks; ++i) { ### Broadcast -We'll often find that we need to data from one rank to all the other ranks. One approach, which is not very efficient, is to use `MPI_Send()` in a loop to send the data from rank to rank one by one. A far more efficient approach is to use the collective function `MPI_Bcast()` to *broadcast* the data from a root rank to every other rank. +We'll often find that we need to data from one rank to all the other ranks. One approach, which is not very efficient, +is to use `MPI_Send()` in a loop to send the data from rank to rank one by one. A far more efficient approach is to use the +collective function `MPI_Bcast()` to *broadcast* the data from a root rank to every other rank. The `MPI_Bcast()` function has the following arguments, ```c @@ -114,23 +129,29 @@ int MPI_Bcast( MPI_Comm comm ); ``` -| | | -| ---- | ---- | -| `*data`: | The data to be sent to all ranks | -| `count`: | The number of elements of data | -| `datatype`: | The datatype of the data | -| `root`: | The rank which data will be sent from | -| `comm:` | The communicator containing the ranks to broadcast to | +| | | +|-------------|-------------------------------------------------------| +| `*data`: | The data to be sent to all ranks | +| `count`: | The number of elements of data | +| `datatype`: | The datatype of the data | +| `root`: | The rank which data will be sent from | +| `comm`: | The communicator containing the ranks to broadcast to | `MPI_Bcast()` is similar to the `MPI_Send()` function. -The main functional difference is that `MPI_Bcast()` sends the data to all ranks (other than itself, where the data already is) instead of a single rank, as shown in the diagram below. +The main functional difference is that `MPI_Bcast()` sends the data to all ranks (other than itself, where the data already is) +instead of a single rank, as shown in the diagram below. ![Each rank sending a piece of data to root rank](fig/broadcast.png) +Unlike `MPI_Send()` and `MPI_Recv()`, collective operations like `MPI_Bcast()` do not require explicitly matching sends and +receives in the user code. The internal implementation of collective functions ensures that all ranks correctly send and +receive data as needed, abstracting this complexity from the programmer. This makes collective operations easier to use and less error-prone compared to point-to-point communication. + + There are lots of use cases for broadcasting data. One common case is when data is sent back to a "root" rank to process, which then broadcasts the results back out to all the other ranks. -Another example, shown in the code exerpt below, is to read data in on the root rank and to broadcast it out. -This is useful pattern on some systems where there are not enough resources (filesystem bandwidth, limited concurrent I/O operations) for every ranks to read the file at once. +Another example, shown in the code excerpt below, is to read data in on the root rank and to broadcast it out. +This is useful pattern on some systems where there are not enough resources (filesystem bandwidth, limited concurrent I/O operations) for every rank to read the file at once. ```c int data_from_file[NUM_POINTS] @@ -184,7 +205,7 @@ int main(int argc, char **argv) { One way to parallelise processing amount of data is to have ranks process a subset of the data. One method for distributing the data to each rank is to have a root rank which prepares the data, and then send the data to every rank. The communication could be done *manually* using point-to-point communication, but it's easier, and faster, to use a single collective communication. -We can use `MPI_Scatter()` to split the data into *equal* sized chunks and communicate a diferent chunk to each rank, as shown in the diagram below. +We can use `MPI_Scatter()` to split the data into *equal* sized chunks and communicate a different chunk to each rank, as shown in the diagram below. ![Each rank sending a piece of data to root rank](fig/scatter.png) @@ -203,20 +224,23 @@ int MPI_Scatter( ); ``` -| | | -| --- | --- | -| `*sendbuff`: | The data to be scattered across ranks (only important for the root rank) | -| `sendcount`: | The number of elements of data to send to each root rank (only important for the root rank) | -| `sendtype`: | The data type of the data being sent (only important for the root rank) | -| `*recvbuffer`: | A buffer to receive data into, including the root rank | -| `recvcount`: | The number of elements of data to receive. Usually the same as `sendcount` | -| `recvtype`: | The data type of the data being received. Usually the same as `sendtype` | -| `root`: | The rank data is being scattered from | -| `comm`: | The communicator | +| | | +|----------------|----------------------------------------------------------------------------------------| +| `*sendbuff`: | The data to be scattered across ranks (only important for the root rank) | +| `sendcount`: | The number of elements of data to send to each rank (only important for the root rank) | +| `sendtype`: | The data type of the data being sent (only important for the root rank) | +| `*recvbuffer`: | A buffer to receive data into, including the root rank | +| `recvcount`: | The number of elements of data to receive. Usually the same as `sendcount` | +| `recvtype`: | The data type of the data being received. Usually the same as `sendtype` | +| `root`: | The rank data is being scattered from | +| `comm`: | The communicator | The data to be *scattered* is split into even chunks of size `sendcount`. If `sendcount` is 2 and `sendtype` is `MPI_INT`, then each rank will receive two integers. -The values for `recvcount` and `recvtype` are the same as `sendcount` and `sendtype`. +The values for `recvcount` and `recvtype` are the same as `sendcount` and `sendtype`. However, there are cases where `sendcount` +and `recvcount` might differ, such as when using derived types, which change how data is packed and unpacked during communication. +For more on derived types and their impact on collective operations, see the +[Derived Data Types](07-derived-data-types.md) episode. If the total amount of data is not evenly divisible by the number of processes, `MPI_Scatter()` will not work. In this case, we need to use [`MPI_Scatterv()`](https://www.open-mpi.org/doc/v4.0/man3/MPI_Scatterv.3.php) instead to specify the amount of data each rank will receive. The code example below shows `MPI_Scatter()` being used to send data which has been initialised only on the root rank. @@ -258,18 +282,18 @@ int MPI_Gather( MPI_Comm comm ); ``` -| | | -| --- | --- | -| `*sendbuff`: | The data to send to the root rank | -| `sendcount`: | The number of elements of data to send | -| `sendtype`: | The data type of the data being sent | -| `recvbuff`: | The buffer to put gathered data into (only important for the root rank) | -| `recvcount`: | The number of elements being received, usually the same as `sendcount` | -| `recvtype`: | The data type of the data being received, usually the same as `sendtype` | -| `root`: | The root rank, where data will be gathered to | -| `comm`: | The communicator | - -The receive buffer needs to be large enough to hold data data from all of the ranks. For example, if there are 4 ranks sending 10 integers, then `recvbuffer` needs to be able to store *at least* 40 integers. +| | | +|--------------|--------------------------------------------------------------------------| +| `*sendbuff`: | The data to send to the root rank | +| `sendcount`: | The number of elements of data to send | +| `sendtype`: | The data type of the data being sent | +| `*recvbuff`: | The buffer to put gathered data into (only important for the root rank) | +| `recvcount`: | The number of elements being received, usually the same as `sendcount` | +| `recvtype`: | The data type of the data being received, usually the same as `sendtype` | +| `root`: | The root rank, where data will be gathered to | +| `comm`: | The communicator | + +The receive buffer needs to be large enough to hold data from all of the ranks. For example, if there are 4 ranks sending 10 integers, then `recvbuffer` needs to be able to store *at least* 40 integers. We can think of `MPI_Gather()` as being the inverse of `MPI_Scatter()`. This is shown in the diagram below, where data from each rank on the left is sent to the root rank (rank 0) on the right. @@ -340,7 +364,7 @@ int main(int argc, char **argv) { ### Reduce -A reduction operation is one which takes a values across the ranks, and combines them into a single value. +A reduction operation is one which takes values across the ranks, and combines them into a single value. Reductions are probably the most common collective operation you will use. The example at the beginning of this episode was a reduction operation, summing up a bunch of numbers, implemented using point-to-point communication. Reduction operations can be done using the collection function `MPI_Reduce()`, which has the following arguments: @@ -357,20 +381,20 @@ int MPI_Reduce( ); ``` -| | | -| --- | --- | -| `*sendbuff`: | The data to be reduced by the root rank | -| `*recvbuffer`: | A buffer to contain the reduction output | -| `count`: | The number of elements of data to be reduced | -| `datatype`: | The data type of the data | -| `op`: | The reduction operation to perform | -| `root`: | The root rank, which will perform the reduction | -| `comm`: | The communicator | +| | | +|----------------|-------------------------------------------------| +| `*sendbuff`: | The data to be reduced by the root rank | +| `*recvbuffer`: | A buffer to contain the reduction output | +| `count`: | The number of elements of data to be reduced | +| `datatype`: | The data type of the data | +| `op`: | The reduction operation to perform | +| `root`: | The root rank, which will perform the reduction | +| `comm`: | The communicator | The `op` argument controls which reduction operation is carried out, from the following possible operations: | Operation | Description | -| ------------ | -------------------------------------------------------------------------------- | +|--------------|----------------------------------------------------------------------------------| | `MPI_SUM` | Calculate the sum of numbers sent by each rank. | | `MPI_MAX` | Return the maximum value of numbers sent by each rank. | | `MPI_MIN` | Return the minimum of numbers sent by each rank. | @@ -408,18 +432,19 @@ int MPI_Allreduce( MPI_Comm comm  ); ``` -| | | -| --- | --- | -| `*sendbuff`: | The data to be reduced, on all ranks | +| | | +|----------------|--------------------------------------------------| +| `*sendbuff`: | The data to be reduced, on all ranks | | `*recvbuffer`: | A buffer which will contain the reduction output | -| `count`: | The number of elements of data to be reduced | -| `datatype`: | The data type of the data | -| `op`: | The reduction operation to use | -| `comm`: | The communicator | +| `count`: | The number of elements of data to be reduced | +| `datatype`: | The data type of the data | +| `op`: | The reduction operation to use | +| `comm`: | The communicator | ![Each rank sending a piece of data to root rank](fig/allreduce.png) -`MPI_Allreduce()` performs the same operations as `MPI_Reduce()`, but the result is sent to all ranks rather than only being available on the root rank. This means we can remove the `MPI_Bcast()` in the previous code example and remove almost all of the code in the reduction example using point-to-point communication at the beginning of the episode. This is shown in the following code example: +`MPI_Allreduce()` performs the same operations as `MPI_Reduce()`, but the result is sent to all ranks rather than only being available on the root rank. +This means we can remove the `MPI_Bcast()` in the previous code example and remove almost all of the code in the reduction example using point-to-point communication at the beginning of the episode. This is shown in the following code example: ```c int sum; @@ -557,4 +582,4 @@ double find_maximum(double *vector, int N) { The collective functions introduced in this episode do not represent an exhaustive list of *all* collective operations in MPI. There are a number which are not covered, as their usage is not as common. You can usually find a list of the collective functions available for the implementation of MPI you choose to use, e.g. [Microsoft MPI documentation](https://learn.microsoft.com/en-us/message-passing-interface/mpi-collective-functions). -:::: +:::: \ No newline at end of file From e8128e03750da669146bac9522829383882151c0 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 2 Dec 2024 14:21:12 +0000 Subject: [PATCH 14/34] Fixed typos for Episode 2-11 --- .../hpc_mpi/02_mpi_api.md | 16 ++-- .../hpc_mpi/06_non_blocking_communication.md | 86 ++++++++++--------- .../hpc_mpi/07-derived-data-types.md | 17 ++-- .../hpc_mpi/08_porting_serial_to_mpi.md | 10 ++- .../hpc_mpi/09_optimising_mpi.md | 18 ++-- .../hpc_mpi/10_communication_patterns.md | 39 +++++---- .../hpc_mpi/11_advanced_communication.md | 79 ++++++++--------- 7 files changed, 139 insertions(+), 126 deletions(-) diff --git a/high_performance_computing/hpc_mpi/02_mpi_api.md b/high_performance_computing/hpc_mpi/02_mpi_api.md index 561dde80..21c8c888 100644 --- a/high_performance_computing/hpc_mpi/02_mpi_api.md +++ b/high_performance_computing/hpc_mpi/02_mpi_api.md @@ -29,7 +29,7 @@ Since its inception, MPI has undergone several revisions, each introducing new f - **MPI-1 (1994):** The initial release of the MPI standard provided a common set of functions, datatypes, and communication semantics. It formed the foundation for parallel programming using MPI. -- **MPI-2 (1997):** This version expanded upon MPI-1 by introducing additional features such as dynamic process management, one-sided communication, paralell I/O, C++ and Fortran 90 bindings. +- **MPI-2 (1997):** This version expanded upon MPI-1 by introducing additional features such as dynamic process management, one-sided communication, parallel I/O, C++ and Fortran 90 bindings. MPI-2 improved the flexibility and capabilities of MPI programs. - **MPI-3 (2012):** MPI-3 brought significant enhancements to the MPI standard, including support for non-blocking collectives, improved multithreading, and performance optimisations. It also addressed limitations from previous versions and introduced fully compliant Fortran 2008 bindings. @@ -98,7 +98,7 @@ If you are not sure which implementation/version of MPI you should use on a part ## MPI Elsewhere -This episode assumes you will be using a HPC cluster, but you can also install OpenMPI on a desktop or laptop: +This episode assumes you will be using an HPC cluster, but you can also install OpenMPI on a desktop or laptop: - **Linux:** Most distributions have OpenMPI available in their package manager: @@ -150,7 +150,7 @@ int main (int argc, char *argv[]) { Although the code is not an MPI program, we can use the command `mpicc` to compile it. The `mpicc` command is essentially a wrapper around the underlying C compiler, such as **gcc**, providing additional functionality for compiling MPI programs. It simplifies the compilation process by incorporating MPI-specific configurations and automatically linking the necessary MPI libraries and header files. -Therefore the below command generates an executable file named **hello_world** . +Therefore, the below command generates an executable file named **hello_world** . ```bash mpicc -o hello_world hello_world.c @@ -208,11 +208,12 @@ As we've just learned, running a program with `mpiexec` or `mpirun` results in t mpirun -n 4 ./hello_world ``` -However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. For the copies to work together, they need to know about their role in the computation, in order to properly take advantage of parallelisation. This usually also requires knowing the total number of tasks running at the same time. +However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. +For the copies to work together, they need to know about their role in the computation, in order to properly take advantage of parallelisation. This usually also requires knowing the total number of tasks running at the same time. - The program needs to call the `MPI_Init()` function. - `MPI_Init()` sets up the environment for MPI, and assigns a number (called the *rank*) to each process. -- At the end, each process should also cleanup by calling `MPI_Finalize()`. +- At the end, each process should also clean up by calling `MPI_Finalize()`. ```c int MPI_Init(&argc, &argv); @@ -381,5 +382,6 @@ For this we would need ways for ranks to communicate - the primary benefit of MP ## What About Python? -In [MPI for Python (mpi4py)](https://mpi4py.readthedocs.io/en/stable/), the initialisation and finalisation of MPI are handled by the library, and the user can perform MPI calls after ``from mpi4py import MPI``. -:::: +In [MPI for Python (mpi4py)](https://mpi4py.readthedocs.io/en/stable/), +the initialisation and finalisation of MPI are handled by the library, and the user can perform MPI calls after ``from mpi4py import MPI``. +:::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/06_non_blocking_communication.md b/high_performance_computing/hpc_mpi/06_non_blocking_communication.md index 58bf2d90..a2a7bce9 100644 --- a/high_performance_computing/hpc_mpi/06_non_blocking_communication.md +++ b/high_performance_computing/hpc_mpi/06_non_blocking_communication.md @@ -27,7 +27,7 @@ Non-blocking communication is communication which happens in the background. So Just as with buffered, synchronous, ready and standard sends, MPI has to be programmed to use either blocking or non-blocking communication. For almost every blocking function, there is a non-blocking equivalent. They have the same name as their blocking counterpart, but prefixed with "I". The "I" stands for "immediate", indicating that the function returns immediately and does not block the program. The table below shows some examples of blocking functions and their non-blocking counterparts. | Blocking | Non-blocking | -| --------------- | ---------------- | +|-----------------|------------------| | `MPI_Bsend()` | `MPI_Ibsend()` | | `MPI_Barrier()` | `MPI_Ibarrier()` | | `MPI_Reduce()` | `MPI_Ireduce()` | @@ -36,7 +36,10 @@ But, this isn't the complete picture. As we'll see later, we need to do some add non-blocking communications. :::: -By effectively utilizing non-blocking communication, we can develop applications that scale significantly better during intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. Since non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging to see if the communication has finished, to ensure ranks are synchronised. If we check too often, or don't have enough tasks to "fill in the gaps", then there is no advantage to using non-blocking communication and we may replace communication overheads with time spent keeping ranks in sync! It is not always clear cut or predictable if non-blocking communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. This essentially makes that non-blocking communication a blocking communication. Therefore unless our code is structured to take advantage of being able to overlap communication with computation, non-blocking communication adds complexity to our code for no gain. +By effectively utilising non-blocking communication, we can develop applications that scale significantly better during intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. +Since non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging to see if the communication has finished, +to ensure ranks are synchronised. If we check too often, or don't have enough tasks to "fill in the gaps", then there is no advantage to using non-blocking communication, and we may replace communication overheads with time spent keeping ranks in sync! It is not always clear-cut or predictable if non-blocking communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. This essentially makes that non-blocking communication a blocking communication. +Therefore, unless our code is structured to take advantage of being able to overlap communication with computation, non-blocking communication adds complexity to our code for no gain. ![Non-blocking communication with data dependency](fig/non-blocking-wait-data.png) @@ -56,7 +59,7 @@ On the other hand, some disadvantages are: - It is more difficult to use non-blocking communication. Not only does it result in more, and more complex, lines of code, we also have to worry about rank synchronisation and data dependency. -- Whilst typically using non-blocking communication, where appropriate, improves performance, it's not always clear cut or predictable if non-blocking will result in sufficient performance gains to justify the increased complexity. +- Whilst typically using non-blocking communication, where appropriate, improves performance, it's not always clear-cut or predictable if non-blocking will result in sufficient performance gains to justify the increased complexity. ::: :::: @@ -77,7 +80,7 @@ int MPI_Isend( ); ``` -| | | +| | | |-------------|-----------------------------------------------------| | `*buf`: | The data to be sent | | `count`: | The number of elements of data | @@ -88,7 +91,7 @@ int MPI_Isend( | `*request`: | The request handle, used to track the communication | The arguments are identical to `MPI_Send()`, other than the addition of the `*request` argument. -This argument is known as an *handle* (because it "handles" a communication request) which is used to track the progress of a (non-blocking) communication. +This argument is known as a *handle* (because it "handles" a communication request) which is used to track the progress of a (non-blocking) communication. When we use non-blocking communication, we have to follow it up with `MPI_Wait()` to synchronise the program and make sure `*buf` is ready to be re-used. This is incredibly important to do. @@ -98,8 +101,9 @@ Suppose we are sending an array of integers, MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); some_ints[1] = 5; /* !!! don't do this !!! */ ``` -Modifying `some_ints` before the send has completed is undefined behaviour, and can result in breaking results! For -example, if `MPI_Isend()` decides to use its buffered mode then modifying `some_ints` before it's finished being copied to the send buffer will means the wrong data is sent. Every non-blocking communication has to have a corresponding `MPI_Wait()`, to wait and synchronise the program to ensure that the data being sent is ready to be modified again. `MPI_Wait()` is a blocking function which will only return when a communication has finished. +Modifying `some_ints` before the **send** has completed is undefined behaviour, and can result in breaking results! For +example, if `MPI_Isend()` decides to use its buffered mode then modifying `some_ints` before it's finished being copied to +the send buffer means the wrong data is sent. Every non-blocking communication has to have a corresponding `MPI_Wait()`, to wait and synchronise the program to ensure that the data being sent is ready to be modified again. `MPI_Wait()` is a blocking function which will only return when a communication has finished. ```c int MPI_Wait( @@ -107,10 +111,10 @@ int MPI_Wait( MPI_Status *status ); ``` -| | | -|-------------|----------------------| +| | | +|-------------|------------------------------------------| | `*request`: | The request handle for the communication | -| `*status`: | The status handle for the communication | +| `*status`: | The status handle for the communication | Once we have used `MPI_Wait()` and the communication has finished, we can safely modify `some_ints` again. To receive the data send using a non-blocking send, we can use either the blocking `MPI_Recv()` or it's non-blocking variant. @@ -127,15 +131,15 @@ int MPI_Irecv( ); ``` -| | | -|-------------|----------------------| -| `*buf`: | A buffer to receive data into | -| `count`: | The number of elements of data to receive | -| `datatype`: | The data type of the data | -| `source`: | The rank to receive data from | -| `tag`: | The communication tag | -| `comm`: | The communicator | -| `*request`: | The request handle for the receive | +| | | +|-------------|-------------------------------------------| +| `*buf`: | A buffer to receive data into | +| `count`: | The number of elements of data to receive | +| `datatype`: | The data type of the data | +| `source`: | The rank to receive data from | +| `tag`: | The communication tag | +| `comm`: | The communicator | +| `*request`: | The request handle for the receive | :::::challenge{id=true-or-false, title="True or False?"} @@ -167,7 +171,7 @@ if (my_rank == 0) { } ``` -The code above is functionally identical to blocking communication, because of `MPI_Wait()` is blocking. The program will not continue until `MPI_Wait()` returns. Since there is no additional work between the communication call and blocking wait, this is a poor example of how non-blocking communication should be used. It doesn't take advantage of the asynchronous nature of non-blocking communication at all. To really make use of non-blocking communication, we need to interleave computation (or any busy work we need to do) with communication, such as as in the next example. +The code above is functionally identical to blocking communication, because of `MPI_Wait()` is blocking. The program will not continue until `MPI_Wait()` returns. Since there is no additional work between the communication call and blocking wait, this is a poor example of how non-blocking communication should be used. It doesn't take advantage of the asynchronous nature of non-blocking communication at all. To really make use of non-blocking communication, we need to interleave computation (or any busy work we need to do) with communication, such as in the next example. ```c MPI_Status status; @@ -232,8 +236,8 @@ MPI_Waitall(2, requests, statuses); // Wait for both requests in one function ``` This version of the code will not deadlock, because the non-blocking functions return immediately. So even though rank -0 and 1 one both send, meaning there is no corresponding receive, the immediate return from send means the -receive function is still called. Thus a deadlock cannot happen. +0 and 1 one both send, meaning there is no corresponding **receive**, the immediate return from send means the +receive function is still called. Thus, a deadlock cannot happen. However, it is still possible to create a deadlock using `MPI_Wait()`. If `MPI_Wait()` is waiting to for `MPI_Irecv()` to get some data, but there is no matching send operation (so no data has been sent), then `MPI_Wait()` can never return resulting in a deadlock. In the example code below, rank 0 becomes deadlocked. @@ -265,11 +269,11 @@ int MPI_Test( MPI_Status *status, ); ``` -| | | -|-------------|------------------------------------------| -| `*request`: | The request handle for the communication | -| `*flag`: | A flag to indicate if the communication has completed | -| `*status`: | The status handle for the communication | +| | | +|-------------|-------------------------------------------------------| +| `*request`: | The request handle for the communication | +| `*flag`: | A flag to indicate if the communication has completed | +| `*status`: | The status handle for the communication | `*request` and `*status` are the same you'd use for `MPI_Wait()`. `*flag` is the variable which is modified to indicate if the communication has finished or not. Since it's an integer, if the communication hasn't finished then `flag == 0`. @@ -334,7 +338,7 @@ if (elapsed_time >= COMM_TIMEOUT) { Something like this would effectively remove deadlocks in our program, and allows us to take appropriate actions to recover the program back into a predictable state. In reality, however, it would be hard to find a useful and appropriate use case for this in scientific computing. -In any case, though, it demonstrate the power and flexibility offered by non-blocking communication. +In any case, though, it demonstrates the power and flexibility offered by non-blocking communication. :::: :::::challenge{id=try-it-yourself, title="Try it yourself"} @@ -342,7 +346,7 @@ In the MPI program below, a chain of ranks has been set up so one rank will rece ![A chain of ranks](fig/rank_chain.png) -For following skeleton below, use non-blocking communication to send `send_message` to the right right and receive a message from the left rank. Create two programs, one using `MPI_Wait()` and the other using `MPI_Test()`. +For following skeleton below, use non-blocking communication to send `send_message` to the right and receive a message from the left rank. Create two programs, one using `MPI_Wait()` and the other using `MPI_Test()`. ```c #include @@ -463,16 +467,16 @@ int MPI_Ireduce( ); ``` -| | | -|-------------|------------------------------------------| -| `*sendbuf`: | The data to be reduced by the root rank | -| `*recvbuf`: | A buffer to contain the reduction output | -| `count`: | The number of elements of data to be reduced | -| `datatype`: | The data type of the data | -| `op`: | The reduction operation to perform | -| `root`: | The root rank, which will perform the reduction | -| `comm`: | The communicator | -| `*request`: | The request handle for the communicator | +| | | +|-------------|-------------------------------------------------| +| `*sendbuf`: | The data to be reduced by the root rank | +| `*recvbuf`: | A buffer to contain the reduction output | +| `count`: | The number of elements of data to be reduced | +| `datatype`: | The data type of the data | +| `op`: | The reduction operation to perform | +| `root`: | The root rank, which will perform the reduction | +| `comm`: | The communicator | +| `*request`: | The request handle for the communicator | As with `MPI_Send()` vs. `MPI_Isend()` the only change in using the non-blocking variant of `MPI_Reduce()` is the addition of the `*request` argument, which returns a request handle. This is the request handle we'll use with either `MPI_Wait()` or `MPI_Test()` to ensure that the communication has finished, and been successful. @@ -503,7 +507,7 @@ When a rank reaches a non-blocking barrier, `MPI_Ibarrier()` will return immedia Non-blocking barriers can be used to help hide/reduce synchronisation overhead. We may want to add a synchronisation point in our program so the ranks start some work all at the same time. With a blocking barrier, the ranks have to wait for every rank to reach the barrier, and can't do anything other than wait. -But with a non-blocking barrier, we can overlap the barrier operation with other, ndependent, work whilst ranks wait for the other ranks to catch up. +But with a non-blocking barrier, we can overlap the barrier operation with other, independent, work whilst ranks wait for the other ranks to catch up. :::: ::::: @@ -592,4 +596,4 @@ int main(int argc, char **argv) ``` :::: -::::: +::::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/07-derived-data-types.md b/high_performance_computing/hpc_mpi/07-derived-data-types.md index 9ff111aa..b5e907b4 100644 --- a/high_performance_computing/hpc_mpi/07-derived-data-types.md +++ b/high_performance_computing/hpc_mpi/07-derived-data-types.md @@ -21,7 +21,8 @@ To help with this, MPI provides an interface to create new types known as derive ## Size limitations for messages -All throughout MPI, the argument which says how many elements of data are being communicated is an integer: int count. In most 64-bit Linux systems, int's are usually 32-bit and so the biggest number you can pass to count is 2^31 - 1 = 2,147,483,647, which is about 2 billion. Arrays which exceed this length can't be communicated easily in versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in smaller, more manageable chunks. The other is to use derived types, to re-shape the data. +All throughout MPI, the argument which says how many elements of data are being communicated is an integer: int count. +In most 64-bit Linux systems, ints are usually 32-bit and so the biggest number you can pass to count is 2^31 - 1 = 2,147,483,647, which is about 2 billion. Arrays which exceed this length can't be communicated easily in versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in smaller, more manageable chunks. The other is to use derived types, to re-shape the data. :::: Almost all scientific and computing problems nowadays require us to think in more than one dimension. Using @@ -130,13 +131,13 @@ int MPI_Type_vector( MPI_Datatype *newtype ); ``` -| | | -| --- | --- | -| `count`: | The number of "blocks" which make up the vector | -| `blocklength`: | The number of contiguous elements in a block | -| `stride`: | The number of elements between the start of each block | -| `oldtype`: | The data type of the elements of the vector, e.g. MPI_INT, MPI_FLOAT | -| `newtype`: | The newly created data type to represent the vector | +| | | +|----------------|----------------------------------------------------------------------| +| `count`: | The number of "blocks" which make up the vector | +| `blocklength`: | The number of contiguous elements in a block | +| `stride`: | The number of elements between the start of each block | +| `oldtype`: | The data type of the elements of the vector, e.g. MPI_INT, MPI_FLOAT | +| `newtype`: | The newly created data type to represent the vector | To understand what the arguments mean, look at the diagram below showing a vector to send two rows of a 4 x 4 matrix with a row in between (rows 2 and 4), diff --git a/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md b/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md index 42f924aa..2dc1075a 100644 --- a/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md +++ b/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md @@ -221,7 +221,7 @@ If the serial parts are a significant part of the algorithm, it may not be possi Examine the code and try to identify any serial regions that can't (or shouldn't) be parallelised. ::::solution -There aren't any large or time consuming serial regions, which is good from a parallelism perspective. +There aren't any large or time-consuming serial regions, which is good from a parallelism perspective. However, there are a couple of small regions that are not amenable to running in parallel: - Setting the `10.0` initial temperature condition at the stick 'starting' boundary. We only need to set this once at the beginning of the stick, and not at the boundary of every section of the stick @@ -289,7 +289,8 @@ Then at the very end of `main()` let's complete our use of MPI: ### `main()`: Initialising the Simulation and Printing the Result -Since we're not initialising for the entire stick (`GRIDSIZE`) but just the section apportioned to our rank (`rank_gridsize`), we need to amend the loop that initialises `u` and `rho` accordingly, to: +Since we're not initialising the entire stick (`GRIDSIZE`), but only the section apportioned to our rank (`rank_gridsize`), +we need to adjust the loop that initialises `u` and `rho` accordingly. The revised loop as follows: ```c // Initialise the u and rho field to 0 @@ -375,7 +376,8 @@ Insert the following into the `poisson_step()` function, putting the declaration MPI_Allreduce(&unorm, &global_unorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); ``` -So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value global_unorm`. We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: +So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value +global_unorm`. We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: ```c return global_unorm; @@ -511,4 +513,4 @@ At initialisation, instead of setting it to zero we could do: ``` :::: -::::: +::::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/09_optimising_mpi.md b/high_performance_computing/hpc_mpi/09_optimising_mpi.md index 4802af2e..3fbaec79 100644 --- a/high_performance_computing/hpc_mpi/09_optimising_mpi.md +++ b/high_performance_computing/hpc_mpi/09_optimising_mpi.md @@ -28,7 +28,8 @@ Therefore, it's really helpful to understand how well our code *scales* in perfo ## Prerequisite: [Intro to High Performance Computing](../hpc_intro/01_hpc_intro) -Whilst the previous episodes can be done on a laptop or desktop, this episode covers how to profile your code using tools that are only available on a HPC cluster. +Whilst the previous episodes can be done on a laptop or desktop, this episode covers how to profile your code using tools that are only available on +an HPC cluster. :::: ## Characterising the Scalability of Code @@ -48,8 +49,9 @@ Ideally, we would like software to have a linear speedup that is equal to the nu (speedup = N), as that would mean that every processor would be contributing 100% of its computational power. Unfortunately, this is a very challenging goal for real applications to attain, since there is always an overhead to making parallel use of greater resources. -In addition, in a program there is always some portion of it which must be executed in serial (such as initialisation routines, I/O operations and inter-communication) which cannot be parallelised. -This limits how much a program can be speeded up, +In addition, in a program there is always some portion of it which must be executed in serial (such as initialisation routines, I/O operations and +inter-communication) which cannot be parallelised. +This limits how much a program can be sped up, as the program will always take at least the length of the serial portion. ### Amdahl's Law and Strong Scaling @@ -74,7 +76,7 @@ Amdahl’s law states that, for a fixed problem, the upper limit of speedup is d ## Amdahl's Law in Practice Consider a program that takes 20 hours to run using one core. -If a particular part of the rogram, which takes one hour to execute, cannot be parallelised (s = 1/20 = 0.05), and if the code that takes up the remaining 19 hours of execution time can be parallelised (p = 1 − s = 0.95), then regardless of how many processors are devoted to a parallelised execution of this program, the minimum execution time cannot be less than that critical one hour. +If a particular part of the program, which takes one hour to execute, cannot be parallelised (s = 1/20 = 0.05), and if the code that takes up the remaining 19 hours of execution time can be parallelised (p = 1 − s = 0.95), then regardless of how many processors are devoted to a parallelised execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (when N = ∞, speedup = 1/s = 20). :::: @@ -130,7 +132,7 @@ The figure below shows an example of the scaling with `GRIDSIZE=512` and `GRIDSI ![Figure showing the result described above for `GRIDSIZE=512` and `GRIDSIZE=2048`](fig/poisson_scaling_plot.png) In the example, which runs on a machine with two 20-core Intel Xeon Scalable CPUs, using 32 ranks actually takes more time. -The 32 ranks don't fit on one CPU and communicating between the the two CPUs takes more time, even though they are in the same machine. +The 32 ranks don't fit on one CPU and communicating between the two CPUs takes more time, even though they are in the same machine. The communication could be made more efficient. We could use non-blocking communication and do some of the computation while communication is happening. @@ -241,7 +243,7 @@ For more information on ARM Forge see the [product website](https://www.arm.com/ ## Software Availability -The ARM Forge suite of tools are licensed, and so may or may not be available on your HPC cluster (and certainly won't be on your laptop or desktop unless you buy a license and build them yourself!). +The ARM Forge suite of tools is licensed, and so may or may not be available on your HPC cluster (and certainly won't be on your laptop or desktop unless you buy a license and build them yourself!). If you don't have access to the ARM Forge tools, your local HPC cluster should have an alternative installed with similar functionality. :::: @@ -282,7 +284,7 @@ terminal and one `.html` file which can be opened in a browser cat poisson_mpi_4p_1n_2024-01-30_15-38.txt ``` -```text +``` Command: mpirun -n 4 poisson_mpi Resources: 1 node (28 physical, 56 logical cores per node) Memory: 503 GiB per node @@ -341,4 +343,4 @@ A general workflow for optimising a code, whether parallel or serial, is as foll Then we can repeat this process as needed. But note that eventually this process will yield diminishing returns, and over-optimisation is a trap to avoid - hence the importance of continuing to measure efficiency as you progress. -:::: +:::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/10_communication_patterns.md b/high_performance_computing/hpc_mpi/10_communication_patterns.md index d35d4bcf..f1e2be2e 100644 --- a/high_performance_computing/hpc_mpi/10_communication_patterns.md +++ b/high_performance_computing/hpc_mpi/10_communication_patterns.md @@ -39,7 +39,7 @@ Since scatter and gather communications are collective, the communication time r The amount of messages that needs to be sent increases logarithmically with the number of ranks. The most efficient implementation of scatter and gather communication are to use the collective functions (`MPI_Scatter()` and `MPI_Gather()`) in the MPI library. -One method for parallelising matrix multiplication is with a scatter and gather ommunication. +One method for parallelising matrix multiplication is with a scatter and gather communication. To multiply two matrices, we follow the following equation, $$ \left[ \begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22}\end{array} \right] \cdot \left[ \begin{array}{cc}B_{11} & B_{12} \\ B_{21} & B_{22}\end{array} \right] = \left[ \begin{array}{cc}A_{11} \cdot B_{11} + A_{12} \cdot B_{21} & A_{11} \cdot B_{12} + A_{12} \cdot B_{22} \\ A_{21} \cdot B_{11} + A_{22} \cdot B_{21} & A_{21} \cdot B_{12} + A_{22} \cdot B_{22}\end{array} \right]$$ @@ -86,10 +86,11 @@ It is again, like with scattering and gathering, best to use the reduction funct Given the fact that reductions fit in almost any algorithm or data access pattern, there are countless examples to show a reduction communication pattern. In the next code example, a Monte Carlo algorithm is implemented which estimates the value of $\pi$. -To do that, a billion random points are generated and checked whether they fall within or outside of a circle. -The ratio of points in and outside of the circle is propotional to the value of $\pi$. +To do this, a billion random points are generated and checked to see if they fall inside or outside a circle. +The ratio of points inside the circle to the total number of points is proportional to the value of $\pi$. -Since each point generated and its position within the circle is completely independent to the other points, the communication pattern is simple (this is also an example of an embarrassingly parallel problem) as we only need one reduction. +Since each point generated and its position within the circle is completely independent to the other points, +the communication pattern is simple (this is also an example of an embarrassingly parallel problem) as we only need one reduction. To parallelise the problem, each rank generates a sub-set of the total number of points and a reduction is done at the end, to calculate the total number of points within the circle from the entire sample. ```c @@ -138,7 +139,7 @@ Here is a (non-exhaustive) list of examples where reduction operations are usefu 1. Finding the maximum/minimum or average temperature in a simulation grid: by conducting a reduction across all the grid cells (which may have, for example, been scattered across ranks), you can easily find the maximum/minimum temperature in the grid, or sum up the temperatures to calculate the average temperature. 2. Combining results into a global state: in some simulations, such as climate modelling, the simulation is split into discrete time steps. - At the end of each time step, a reduction can be used to update the global state or combine together pieces of data (similar to a gather operation). + At the end of each time step, a reduction can be used to update the global state or combine pieces of data (similar to a gather operation). 3. Large statistical models: in a large statistical model, the large amounts of data can be processed by splitting it across ranks and calculating statistical values for the sub-set of data. The final values are then calculated by using a reduction operation and re-normalizing the values appropriately. 4. Numerical integration: each rank will compute the area under the curve for its portion of the curve. @@ -297,20 +298,20 @@ int MPI_Sendrecv( ); ``` -| | | -| --- | --- | -| `*sendbuf`: | The data to be sent to `dest` | +| | | +|--------------|-----------------------------------------------------| +| `*sendbuf`: | The data to be sent to `dest` | | `sendcount`: | The number of elements of data to be sent to `dest` | -| `sendtype`: | The data type of the data to be sent to `dest` | -| `dest`: | The rank where data is being sent to | -| `sendtag`: | The communication tag for the send | -| `*recvbuf`: | A buffer for data being received | -| `recvcount`: | The number of elements of data to receive | -| `recvtype`: | The data type of the data being received | -| `source`: | The rank where data is coming from | -| `recvtag`: | The communication tag for the receive | -| `comm`: | The communicator | -| `*status`: | The status handle for the receive | +| `sendtype`: | The data type of the data to be sent to `dest` | +| `dest`: | The rank where data is being sent to | +| `sendtag`: | The communication tag for the send | +| `*recvbuf`: | A buffer for data being received | +| `recvcount`: | The number of elements of data to receive | +| `recvtype`: | The data type of the data being received | +| `source`: | The rank where data is coming from | +| `recvtag`: | The communication tag for the receive | +| `comm`: | The communicator | +| `*status`: | The status handle for the receive | :::: ```c @@ -361,4 +362,4 @@ To communicate the halos, we need to: To re-build the sub-domains into one domain, we can do the reverse of the hidden code exert of the function `scatter_sub_arrays_to_other ranks`. Instead of the root rank sending data, it instead receives data from the other ranks using the same `sub_array_t` derived type. :::: -::::: +::::: \ No newline at end of file diff --git a/high_performance_computing/hpc_mpi/11_advanced_communication.md b/high_performance_computing/hpc_mpi/11_advanced_communication.md index fb5533b4..5f4f8190 100644 --- a/high_performance_computing/hpc_mpi/11_advanced_communication.md +++ b/high_performance_computing/hpc_mpi/11_advanced_communication.md @@ -37,13 +37,13 @@ int MPI_Type_create_struct( ); ``` -| | | -| --- | --- | -| `count`: | The number of fields in the struct | -| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send()` | -| `*array_of_displacements`: | The relative positions of each field in bytes | -| `*array_of_types`: | The MPI type of each field | -| `*newtype`: | The newly created data type for the struct | +| | | +|----------------------------|----------------------------------------------------------------------------------| +| `count`: | The number of fields in the struct | +| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send()` | +| `*array_of_displacements`: | The relative positions of each field in bytes | +| `*array_of_types`: | The MPI type of each field | +| `*newtype`: | The newly created data type for the struct | The main difference between vector and struct derived types is that the arguments for structs expect arrays, since structs are made up of multiple variables. Most of these arguments are straightforward, given what we've just seen for defining vectors. But `array_of_displacements` is new and unique. @@ -52,7 +52,8 @@ When a struct is created, it occupies a single contiguous block of memory. But t ![Memory layout for a struct](fig/struct_memory_layout.png) -Although the memory used for padding and the struct's data exists in a contiguous block, the actual data we care about is not contiguous any more. This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the stride in vectors. +Although the memory used for padding and the struct's data exists in a contiguous block, +the actual data we care about is no longer contiguous. This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the stride in vectors. To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. To do this, we can use the function `MPI_Get_address()`, @@ -63,10 +64,10 @@ int MPI_Get_address{ }; ``` -| | | -| --- | --- | -| `*location`: | A pointer to the variable we want the address of | -| `*address`: | The address of the variable, as an MPI_Aint (address integer) | +| | | +|--------------|---------------------------------------------------------------| +| `*location`: | A pointer to the variable we want the address of | +| `*address`: | The address of the variable, as an MPI_Aint (address integer) | In the following example, we use `MPI_Type_create_struct()` and `MPI_Get_address()` to create a derived type for a struct with two members, @@ -259,7 +260,7 @@ array using `MPI_Pack()`. ![Layout of packed memory](fig/packed_buffer_layout.png) -The coloured boxes in both memory representations (memory and pakced) are the same chunks of data. The green boxes +The coloured boxes in both memory representations (memory and packed) are the same chunks of data. The green boxes containing only a single number are used to document the number of elements in the block of elements they are adjacent to, in the contiguous buffer. This is optional to do, but is generally good practise to include to create a self-documenting message. From the diagram we can see that we have "packed" non-contiguous blocks of memory into a @@ -282,15 +283,15 @@ int MPI_Pack( ); ``` -| | | -| --- | --- | -| `*inbuf`: | The data to pack into the buffer | -| `incount`: | The number of elements to pack | -| `datatype`: | The data type of the data to pack | -| `*outbuf`: | The out buffer of contiguous data | -| `outsize`: | The size of the out buffer, in bytes | +| | | +|--------------|--------------------------------------------------------------------------------------------| +| `*inbuf`: | The data to pack into the buffer | +| `incount`: | The number of elements to pack | +| `datatype`: | The data type of the data to pack | +| `*outbuf`: | The out buffer of contiguous data | +| `outsize`: | The size of the out buffer, in bytes | | `*position`: | A counter for how far into the contiguous buffer to write (records the position, in bytes) | -| `comm`: | The communicator | +| `comm`: | The communicator | In the above, `inbuf` is the data we want to pack into a contiguous buffer and `incount` and `datatype` define the number of elements in and the datatype of `inbuf`. The parameter `outbuf` is the contiguous buffer the data is packed @@ -302,7 +303,7 @@ in number of elements. Given that `MPI_Pack()` is all about manually arranging d allocation of memory for `outbuf`. But how do we allocate memory for it, and how much should we allocate? Allocation is done by using `malloc()`. Since `MPI_Pack()` works with `outbuf` in terms of bytes, the convention is to declare `outbuf` as a `char *`. The amount of memory to allocate is simply the amount of space, in bytes, required to store all -of the data we want to pack into it. Just like how we would normally use `malloc()` to create an array. If we had +the data we want to pack into it. Just like how we would normally use `malloc()` to create an array. If we had an integer array and a floating point array which we wanted to pack into the buffer, then the size required is easy to calculate, @@ -327,12 +328,12 @@ int MPI_Pack_size( int *size ); ``` -| | | -| --- | --- | -| `incount`: | The number of data elements | -| `datatype`: | The data type of the data | -| `comm`: | The communicator | -| `*size`: | The calculated upper size limit for the buffer, in bytes | +| | | +|-------------|----------------------------------------------------------| +| `incount`: | The number of data elements | +| `datatype`: | The data type of the data | +| `comm`: | The communicator | +| `*size`: | The calculated upper size limit for the buffer, in bytes | `MPI_Pack_size()` is a helper function to calculate the *upper bound* of memory required. It is, in general, preferable to calculate the buffer size using this function, as it takes into account any implementation specific MPI detail and @@ -361,18 +362,18 @@ int MPI_Unpack( ); ``` -| | | -| --- | --- | -| `*inbuf`: | The contiguous buffer to unpack | -| `insize`: | The total size of the buffer, in bytes | -| `*position`: | The position, in bytes, from where to start unpacking from | -| `*outbuf`: | An array, or variable, to unpack data into -- this is the output | -| `outcount`: | The number of elements of data to unpack | -| `datatype`: | The data type of elements to unpack | -| `comm`: | The communicator | +| | | +|--------------|------------------------------------------------------------------| +| `*inbuf`: | The contiguous buffer to unpack | +| `insize`: | The total size of the buffer, in bytes | +| `*position`: | The position, in bytes, from where to start unpacking from | +| `*outbuf`: | An array, or variable, to unpack data into -- this is the output | +| `outcount`: | The number of elements of data to unpack | +| `datatype`: | The data type of elements to unpack | +| `comm`: | The communicator | The arguments for this function are essentially the reverse of `MPI_Pack()`. Instead of being the buffer to pack into, -`inbuf` is now the packed buffer and `position` is the position, in bytes, in the buffer where to unpacking from. +`inbuf` is now the packed buffer, and `position` specified the byte position in the buffer to start unpacking from. `outbuf` is then the variable we want to unpack into, and `outcount` is the number of elements of `datatype` to unpack. In the example below, `MPI_Pack()`, `MPI_Pack_size()` and `MPI_Unpack()` are used to communicate a (non-contiguous) @@ -444,7 +445,7 @@ It works just as well to communicate the buffer using non-blocking methods, as i In some cases, the receiving rank may not know the size of the buffer used in `MPI_Pack()`. This could happen if a message is sent and received in different functions, if some ranks have different branches through the program or if communication happens in a dynamic or non-sequential way. -In these situations, we can use `MPI_Probe()` and `MPI_Get_count` to find the a message being sent and to get the number of elements in the message. +In these situations, we can use `MPI_Probe()` and `MPI_Get_count` to find the message being sent and to get the number of elements in the message. ```c // First probe for a message, to get the status of it From fa013573bb77477ed13062924c773c7b569a60b0 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 9 Dec 2024 07:03:22 +0000 Subject: [PATCH 15/34] Updated barriers code snippet and revised Reporting Progress challenge --- .../hpc_openmp/04_synchronisation.md | 57 ++++++++++--------- 1 file changed, 31 insertions(+), 26 deletions(-) diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index c2475aae..6e8565c3 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -131,35 +131,38 @@ as the calculation depends on this data (See the full code [here](code/examples/ } ``` Similarly, in iterative tasks like matrix calculations, barriers help coordinate threads so that all updates -are finished before moving to the next step. For example, in the following snippet, a barrier makes sure that updates -to `new_matrix` are completed before it is copied into `old_matrix`: +are completed before moving to the next step. For example, in the following snippet, a barrier makes sure that updates +to `next_matrix` are finished by all threads before its values are copied back into +`current_matrix`. ```c ...... int main() { - double old_matrix[NX][NY]; - double new_matrix[NX][NY]; + double current_matrix[NX][NY]; + double next_matrix[NX][NY]; + + /* Initialise the matrix */ + initialise_matrix(current_matrix, NX); - #pragma omp parallel - { - for (int i = 0; i < NUM_ITERATIONS; ++i) { + for (int iter = 0; iter < NUM_ITERATIONS; ++iter) { + #pragma omp parallel + { int thread_id = omp_get_thread_num(); - iterate_matrix_solution(old_matrix, new_matrix, thread_id); /* Each thread updates a portion of the matrix */ + iterate_matrix_solution(current_matrix, next_matrix, thread_id, NX); /* Each thread updates its assigned row */ - #pragma omp barrier /* You may want to wait until new_matrix has been updated by all threads */ - - copy_matrix(new_matrix, old_matrix); + #pragma omp barrier /* Wait for every thread to finish updating next_matrix */ } + + copy_matrix(next_matrix, current_matrix, NX); /* Copy next_matrix to current_matrix */ } } - ``` :::callout{variant='note'} OpenMP does not allow barriers to be placed directly inside `#pragma omp parallel for` loops due to restrictions on closely [nested regions](https://www.openmp.org/spec-html/5.2/openmpse101.html#x258-27100017.1). To coordinate threads -effectively in iterative tasks like this, the loop has been rewritten using a `#pragma omp parallel` construct with -explicit loop control. You can find the full code for this example [here](code/examples/04-matrix-update.c). +effectively in iterative tasks like this, we use a `#pragma omp parallel` construct, which gives explicit control over +the loop and allows proper barrier placement. You can find the full code for this example [here](code/examples/04-matrix-update.c). ::: Barriers introduce additional overhead into our parallel algorithms, as some threads will be idle whilst waiting for @@ -301,9 +304,9 @@ up the private copies into the shared variable value. This means the output will The code below attempts to track the progress of a parallel loop using a shared counter, `progress`. However, it has a problem: the final value of progress is often incorrect, and the progress updates might be inconsistent. -- Can you identify the issue with the current implementation? -- How would you fix it to ensure the progress counter works correctly and updates are synchronised? -- After fixing the issue, experiment with different loop schedulers (`static`, `dynamic`, `guided` and `auto`) +1. Can you identify the issue with the current implementation? +2. How would you fix it to ensure the progress counter works correctly and updates are synchronised? +3. After fixing the issue, experiment with different loop schedulers (`static`, `dynamic`, `guided` and `auto`) to observe how they affect progress reporting. - What changes do you notice in the timing and sequence of updates when using these schedulers? - Which scheduler produces the most predictable progress updates? @@ -311,6 +314,7 @@ to observe how they affect progress reporting. ```c #include #include +#include #define NUM_ELEMENTS 10000 @@ -321,6 +325,10 @@ int main(int argc, char **argv) { #pragma omp parallel for schedule(static) for (int i = 0; i < NUM_ELEMENTS; ++i) { array[i] = log(i) * cos(3.142 * i); + + if (i % (NUM_ELEMENTS / 10) == 0) { + printf("Progress: %d%%\n", (i * 100) / NUM_ELEMENTS); + } progress++; } @@ -336,17 +344,16 @@ gcc counter.c -o counter -fopenmp -lm ``` :::solution -The above program tracks progress using a shared counter (`progress++`) inside the loop, -but it does so without synchronisation, leading to a race condition. -Since multiple threads can access and modify progress at the same time, the final value of progress will likely be incorrect. +1. The above program tracks progress using a shared counter (`progress++`) inside the loop, +but it does so without synchronisation, leading to a race condition. Since multiple threads can access and modify progress at the same time, the final value of progress will likely be incorrect. This happens because the updates to progress are not synchronised across threads. As a result, the final value of -`progress` is often incorrect and varies across runs. You might see output like this: +`progress` is often incorrect and varies across runs. You might see output like this: ```text Final progress: 9983 (Expected: 10000) ``` -To fix this issue, we use a critical region to synchronise updates to progress. +2. To fix this issue, we use a critical region to synchronise updates to progress. We also introduce a second variable, `output_frequency`, to control how often progress updates are reported (e.g., every 10% of the total iterations). @@ -388,9 +395,7 @@ critical region, and if the loop body is lightweight (e.g., simple calculations) computational benefits of parallelisation. For example, if each iteration takes only a few nanoseconds to compute, the time spent waiting for access to the critical region might dominate the runtime. -### Behaviour with different schedulers - -The static scheduler, used in the corrected version, divides iterations evenly among threads. This ensures predictable +3. The static scheduler, used in the corrected version, divides iterations evenly among threads. This ensures predictable and consistent progress updates. For instance, progress increments occur at regular intervals (e.g., 10%, 20%, etc.), producing output like: @@ -676,4 +681,4 @@ for (int i = 0; i < ARRAY_SIZE; ++i) { ``` ::: -:::: \ No newline at end of file +:::: From 3da2bf6d511b99e81f1dfd1372bd8553ad1651d3 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 9 Dec 2024 07:04:14 +0000 Subject: [PATCH 16/34] Updated barriers example --- .../hpc_openmp/code/examples/04-barriers.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/high_performance_computing/hpc_openmp/code/examples/04-barriers.c b/high_performance_computing/hpc_openmp/code/examples/04-barriers.c index 02693d78..8a85fc2c 100644 --- a/high_performance_computing/hpc_openmp/code/examples/04-barriers.c +++ b/high_performance_computing/hpc_openmp/code/examples/04-barriers.c @@ -16,6 +16,7 @@ void initialise_lookup_table(int thread_id, double lookup_table[TABLE_SIZE]) { void do_main_calculation(int thread_id, double lookup_table[TABLE_SIZE]) { int num_threads = omp_get_num_threads(); for (int i = thread_id; i < TABLE_SIZE; i += num_threads) { + lookup_table[i] = lookup_table[i] * 5.0; printf("Thread %d processing lookup_table[%d] = %f\n", thread_id, i, lookup_table[i]); } } @@ -39,4 +40,4 @@ int main() { } return 0; -} \ No newline at end of file +} From 23c2ca0f438919ed9bde198d3728111133e7aaf0 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 9 Dec 2024 07:06:19 +0000 Subject: [PATCH 17/34] Refactored code to use dynamic matrices --- .../code/examples/04-matrix-update.c | 72 +++++++++++++++---- 1 file changed, 57 insertions(+), 15 deletions(-) diff --git a/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c b/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c index f43bf1c5..76d81165 100644 --- a/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c +++ b/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c @@ -1,17 +1,30 @@ #include #include -#define NX 4 -#define NY 4 -#define NUM_ITERATIONS 10 +#define NY 4 // Number of columns +#define NUM_ITERATIONS 10 // Number of iterations -void iterate_matrix_solution(double old_matrix[NX][NY], double new_matrix[NX][NY], int thread_id) { +void initialise_matrix(double matrix[][NY], int NX) { + for (int i = 0; i < NX; ++i) { + for (int j = 0; j < NY; ++j) { + matrix[i][j] = i + j * 0.5; + } + } +} + +void iterate_matrix_solution(double current_matrix[][NY], double next_matrix[][NY], int thread_id, int NX) { for (int j = 0; j < NY; ++j) { - new_matrix[thread_id][j] = old_matrix[thread_id][j] + 1; + if (thread_id > 0) { + /* Each row depends on the current and previous rows */ + next_matrix[thread_id][j] = current_matrix[thread_id][j] + current_matrix[thread_id - 1][j]; + } else { + /* First row is independent */ + next_matrix[thread_id][j] = current_matrix[thread_id][j] + 1; + } } } -void copy_matrix(double src[NX][NY], double dest[NX][NY]) { +void copy_matrix(double src[][NY], double dest[][NY], int NX) { for (int i = 0; i < NX; ++i) { for (int j = 0; j < NY; ++j) { dest[i][j] = src[i][j]; @@ -19,23 +32,52 @@ void copy_matrix(double src[NX][NY], double dest[NX][NY]) { } } +void print_matrix(const char *label, double matrix[][NY], int NX) { + printf("%s:\n", label); + for (int i = 0; i < NX; ++i) { + for (int j = 0; j < NY; ++j) { + printf("%.2f ", matrix[i][j]); + } + printf("\n"); + } + printf("\n"); +} + int main() { - double old_matrix[NX][NY] = {{0}}; - double new_matrix[NX][NY] = {{0}}; + int NX; + /* Dynamically determine NX based on the number of threads */ #pragma omp parallel { - for (int i = 0; i < NUM_ITERATIONS; ++i) { - int thread_id = omp_get_thread_num(); + NX = omp_get_num_threads(); + } - iterate_matrix_solution(old_matrix, new_matrix, thread_id); /* Each thread updates a portion of the matrix */ + double current_matrix[NX][NY]; + double next_matrix[NX][NY]; - #pragma omp barrier /* Wait until all threads complete updates to new_matrix */ + /* Initialise the matrix */ + initialise_matrix(current_matrix, NX); - copy_matrix(new_matrix, old_matrix); + /* Print the initial matrix */ + print_matrix("Initial Matrix", current_matrix, NX); + + + for (int iter = 0; iter < NUM_ITERATIONS; ++iter) { + #pragma omp parallel + { + int thread_id = omp_get_thread_num(); + + /* Update the next matrix based on the current matrix */ + iterate_matrix_solution(current_matrix, next_matrix, thread_id, NX); + + #pragma omp barrier /* Synchronise all threads before copying */ } + + /* Copy the next matrix back into the current matrix */ + copy_matrix(next_matrix, current_matrix, NX); } - printf("Matrix update complete.\n"); + print_matrix("Final Matrix", current_matrix, NX); + return 0; -} \ No newline at end of file +} From e26954ea25fe0a671a9574d28168ef19e23d2f8c Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 9 Dec 2024 10:58:44 +0000 Subject: [PATCH 18/34] Updated matrix-update example code --- .../code/examples/04-matrix-update.c | 56 ++++++------------- 1 file changed, 18 insertions(+), 38 deletions(-) diff --git a/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c b/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c index 76d81165..824f373e 100644 --- a/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c +++ b/high_performance_computing/hpc_openmp/code/examples/04-matrix-update.c @@ -1,18 +1,17 @@ #include #include -#define NY 4 // Number of columns -#define NUM_ITERATIONS 10 // Number of iterations +#define NY 4 +#define NUM_ITERATIONS 10 -void initialise_matrix(double matrix[][NY], int NX) { - for (int i = 0; i < NX; ++i) { +void initialise_matrix(double matrix[][NY], int nx) { + for (int i = 0; i < nx; ++i) { for (int j = 0; j < NY; ++j) { matrix[i][j] = i + j * 0.5; } } } - -void iterate_matrix_solution(double current_matrix[][NY], double next_matrix[][NY], int thread_id, int NX) { +void iterate_matrix_solution(double current_matrix[][NY], double next_matrix[][NY], int thread_id, int nx) { for (int j = 0; j < NY; ++j) { if (thread_id > 0) { /* Each row depends on the current and previous rows */ @@ -23,18 +22,16 @@ void iterate_matrix_solution(double current_matrix[][NY], double next_matrix[][N } } } - -void copy_matrix(double src[][NY], double dest[][NY], int NX) { - for (int i = 0; i < NX; ++i) { +void copy_matrix(double src[][NY], double dest[][NY], int nx) { + for (int i = 0; i < nx; ++i) { for (int j = 0; j < NY; ++j) { dest[i][j] = src[i][j]; } } } - -void print_matrix(const char *label, double matrix[][NY], int NX) { +void print_matrix(const char *label, double matrix[][NY], int nx) { printf("%s:\n", label); - for (int i = 0; i < NX; ++i) { + for (int i = 0; i < nx; ++i) { for (int j = 0; j < NY; ++j) { printf("%.2f ", matrix[i][j]); } @@ -42,42 +39,25 @@ void print_matrix(const char *label, double matrix[][NY], int NX) { } printf("\n"); } - int main() { - int NX; - - /* Dynamically determine NX based on the number of threads */ - #pragma omp parallel - { - NX = omp_get_num_threads(); - } - - double current_matrix[NX][NY]; - double next_matrix[NX][NY]; - - /* Initialise the matrix */ - initialise_matrix(current_matrix, NX); - - /* Print the initial matrix */ - print_matrix("Initial Matrix", current_matrix, NX); + int nx = omp_get_max_threads(); + double current_matrix[nx][NY]; + double next_matrix[nx][NY]; + initialise_matrix(current_matrix, nx); + print_matrix("Initial Matrix", current_matrix, nx); for (int iter = 0; iter < NUM_ITERATIONS; ++iter) { #pragma omp parallel { int thread_id = omp_get_thread_num(); - /* Update the next matrix based on the current matrix */ - iterate_matrix_solution(current_matrix, next_matrix, thread_id, NX); + iterate_matrix_solution(current_matrix, next_matrix, thread_id, nx); #pragma omp barrier /* Synchronise all threads before copying */ } - - /* Copy the next matrix back into the current matrix */ - copy_matrix(next_matrix, current_matrix, NX); + copy_matrix(next_matrix, current_matrix, nx); } - - print_matrix("Final Matrix", current_matrix, NX); - + print_matrix("Final Matrix", current_matrix, nx); return 0; -} +} \ No newline at end of file From 22267cd436ee228a55c44ead6d4514a41ea95ddf Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 9 Dec 2024 13:04:52 +0000 Subject: [PATCH 19/34] Refactored barrier example to use omp_get_max_threads() and updated explanation --- .../hpc_openmp/04_synchronisation.md | 39 ++++++++++++------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index 6e8565c3..0c89b08b 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -131,38 +131,51 @@ as the calculation depends on this data (See the full code [here](code/examples/ } ``` Similarly, in iterative tasks like matrix calculations, barriers help coordinate threads so that all updates -are completed before moving to the next step. For example, in the following snippet, a barrier makes sure that updates -to `next_matrix` are finished by all threads before its values are copied back into -`current_matrix`. +are completed before moving to the next step. For example, in the following snippet, each thread updates its assigned +row of a matrix using data from its current row and the row above (except for the first row, which has no dependency; +see the full code [here](code/examples/04-matrix-update.c)). A barrier ensures that all threads finish updating +their rows in `next_matrix` before the values are copied back into `current_matrix`. Without this barrier, threads might +read outdated or partially updated data, causing inconsistencies. + +Here, the number of rows (`nx`) is dynamically determined at runtime using `omp_get_max_threads()`. This function provides +the maximum number of threads OpenMP can use in a parallel region, based on the system's resources and runtime configuration. +Using this value, we define the number of rows in the matrix, with each row corresponding to a potential thread. This setup +ensures that both the `current_matrix` and `next_matrix provide` rows for the maximum number of threads allocated during +parallel execution. ```c ...... int main() { - double current_matrix[NX][NY]; - double next_matrix[NX][NY]; - - /* Initialise the matrix */ - initialise_matrix(current_matrix, NX); + int nx = omp_get_max_threads(); + double current_matrix[nx][NY]; + double next_matrix[nx][NY]; + + /* Initialise the current matrix */ + initialise_matrix(current_matrix, nx); for (int iter = 0; iter < NUM_ITERATIONS; ++iter) { #pragma omp parallel { int thread_id = omp_get_thread_num(); - iterate_matrix_solution(current_matrix, next_matrix, thread_id, NX); /* Each thread updates its assigned row */ + /* Update next_matrix based on current_matrix */ + iterate_matrix_solution(current_matrix, next_matrix, thread_id, nx); - #pragma omp barrier /* Wait for every thread to finish updating next_matrix */ + #pragma omp barrier /* Synchronise all threads before copying */ } - - copy_matrix(next_matrix, current_matrix, NX); /* Copy next_matrix to current_matrix */ + /* Copy the next_matrix into current_matrix for the next iteration */ + copy_matrix(next_matrix, current_matrix, nx); } + + /* Print the final matrix */ + print_matrix("Final Matrix", current_matrix, nx); } ``` :::callout{variant='note'} OpenMP does not allow barriers to be placed directly inside `#pragma omp parallel for` loops due to restrictions on closely [nested regions](https://www.openmp.org/spec-html/5.2/openmpse101.html#x258-27100017.1). To coordinate threads effectively in iterative tasks like this, we use a `#pragma omp parallel` construct, which gives explicit control over -the loop and allows proper barrier placement. You can find the full code for this example [here](code/examples/04-matrix-update.c). +the loop and allows proper barrier placement. ::: Barriers introduce additional overhead into our parallel algorithms, as some threads will be idle whilst waiting for From 4e577d6f43f9c2bece48214714fe35b81b0ff160 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 10 Dec 2024 00:00:39 +0000 Subject: [PATCH 20/34] Fix markdown linting issues --- .../hpc_openmp/02_intro_openmp.md | 40 +++-- .../hpc_openmp/03_parallel_api.md | 76 ++++---- .../hpc_openmp/04_synchronisation.md | 162 +++++++++--------- .../hpc_openmp/05_hybrid_parallelism.md | 14 +- 4 files changed, 149 insertions(+), 143 deletions(-) diff --git a/high_performance_computing/hpc_openmp/02_intro_openmp.md b/high_performance_computing/hpc_openmp/02_intro_openmp.md index 33cce4b9..5c7e7e13 100644 --- a/high_performance_computing/hpc_openmp/02_intro_openmp.md +++ b/high_performance_computing/hpc_openmp/02_intro_openmp.md @@ -11,15 +11,16 @@ learningOutcomes: ## What is OpenMP? -OpenMP is an industry-standard API specifically designed for parallel programming in shared memory environments. It supports programming in languages such as C, C++, +OpenMP is an industry-standard API specifically designed for parallel programming in shared memory environments. It supports programming in languages such as C, C++, and Fortran. OpenMP is an open source, industry-wide initiative that benefits from collaboration among hardware and software vendors, governed by the OpenMP Architecture Review Board ([OpenMP ARB](https://www.openmp.org/)). -::::challenge{id=timeline, title="An OpenMP Timeline"} +::::challenge{id=timeline title="An OpenMP Timeline"} If you're interested, there's a [timeline of how OpenMP developed](https://www.openmp.org/uncategorized/openmp-timeline/). It provides an overview of OpenMP's evolution until 2014, with significant advancements occurring thereafter. Notably, OpenMP 5.0 marked a significant step in 2018, followed by the latest iteration, OpenMP 5.2, which was released in November 2021. + :::: ## How does it work? @@ -36,7 +37,8 @@ OpenMP consists of three key components that enable parallel programming using t - **Compiler Directives:** OpenMP makes use of special code markers known as *compiler directives* to indicate to the compiler when and how to parallelise various sections of code. These directives are prefixed with `#pragma omp`, and mark sections of code to be executed concurrently by multiple threads. - **Runtime Library Routines:** These are predefined functions provided by the OpenMP runtime library. They allow you to control the behavior of threads, manage synchronisation, and handle parallel execution. For example, we can use the function `omp_get_thread_num()` to obtain the unique identifier of the calling thread. -- **Environment Variables:** These are settings that can be adjusted to influence the behavior of the OpenMP runtime. They provide a way to fine-tune the parallel execution of your program. Setting OpenMP environment variables is typically done similarly to other environment variables for your system. +- **Environment Variables:** These are settings that can be adjusted to influence the behavior of the OpenMP runtime. They provide a way to fine-tune the parallel execution of your program. Setting OpenMP environment variables is typically done similarly to other environment variables for your system. + For instance, you can adjust the number of threads to use for a program you are about to execute by specifying the value in the `OMP_NUM_THREADS` environment variable. Since parallelisation using OpenMP is accomplished by adding compiler directives to existing code structures, it's relatively easy to get started using it. @@ -54,7 +56,7 @@ In order to make use of OpenMP itself, it's usually a case of ensuring you have Save the following code in `hello_world_omp.c`: -~~~c +```c #include #include @@ -64,44 +66,43 @@ int main() { printf("Hello World!\n"); } } -~~~ +``` + :::callout{variant="note"} -In this example, `#include ` is not strictly necessary since the code does -not call OpenMP runtime functions. However, it is a good practice to include this header to make it clear that -the program uses OpenMP and to prepare for future use of OpenMP library functions. +In this example, `#include ` is not strictly necessary since the code does not call OpenMP runtime functions. However, it is a good practice to include this header to make it clear that the program uses OpenMP and to prepare for future use of OpenMP library functions. ::: -You'll likely want to compile it using a standard compiler such as `gcc`, although this may depend on your -system. To enable the creation of multi-threaded code based on OpenMP directives, pass the `-fopenmp` flag to the compiler. +You'll likely want to compile it using a standard compiler such as `gcc`, although this may depend on your system. +To enable the creation of multi-threaded code based on OpenMP directives, pass the `-fopenmp` flag to the compiler. This flag indicates that you're compiling an OpenMP program: -~~~bash +```bash gcc hello_world_omp.c -o hello_world_omp -fopenmp -~~~ +``` Before we run the code we also need to indicate how many threads we wish the program to use. One way to do this is to specify this using the `OMP_NUM_THREADS` environment variable, e.g. -~~~bash +```bash export OMP_NUM_THREADS=4 -~~~ +``` Now you can run it just like any other program using the following command: -~~~bash +```bash ./hello_world_omp -~~~ +``` When you execute the OpenMP program, it will display 'Hello World!' multiple times according to the value we entered in `OMP_NUM_THREADS`, with each thread in the parallel region executing the `printf` statement concurrently: -~~~text +``` text Hello World! Hello World! Hello World! Hello World! -~~~ +``` ::::callout @@ -116,4 +117,5 @@ If you're looking to develop OpenMP programs in VSCode, here are three configura You may need to adapt the `tasks.json` and `launch.json` depending on your platform (in particular, the `program` field in `launch.json` may need to reference a `hello_world_omp.exe` file if running on Windows, and the location of gcc in the `command` field may be different in `tasks.json`). Once you've compiled `hello_world_omp.c` the first time, then, by selecting VSCode's `Run and Debug` tab on the left, the `C++ OpenMP: current file` configuration should appear in the top left which will set `OMP_NUM_THREADS` before running it. -:::: \ No newline at end of file + +:::: diff --git a/high_performance_computing/hpc_openmp/03_parallel_api.md b/high_performance_computing/hpc_openmp/03_parallel_api.md index df3423c0..28f1a724 100644 --- a/high_performance_computing/hpc_openmp/03_parallel_api.md +++ b/high_performance_computing/hpc_openmp/03_parallel_api.md @@ -75,11 +75,9 @@ Of particular importance in parallel programs is how memory is managed and how a and OpenMP has a number of mechanisms to indicate how they should be handled. Essentially, OpenMP provides two ways to do this for variables: -- **Shared**: A single instance of the variable is shared among all threads, meaning every thread can access -and modify the same data. This is useful for shared resources but requires careful management to prevent conflicts or -unintended behavior. -- **Private**: Each thread gets its own isolated copy of the variable, similar to how variables are -private in `if` statements or functions, where each thread’s version is independent and doesn't affect others. +- **Shared**: A single instance of the variable is shared among all threads, meaning every thread can access +and modify the same data. This is useful for shared resources but requires careful management to prevent conflicts or unintended behavior. +- **Private**: Each thread gets its own isolated copy of the variable, similar to how variables are private in `if` statements or functions, where each thread’s version is independent and doesn't affect others. For example, what if we wanted to hold the thread ID and the total number of threads within variables in the code block? Let's start by amending the parallel code block to the following: @@ -127,8 +125,7 @@ But what about declarations outside of this block? For example: Which may seem on the surface to be correct. However, this illustrates a critical point about why we need to be careful. -Now, since the variable declarations are outside the parallel block, they are, -by default, *shared* across threads. This means any thread can modify these variables at any time, +Now, since the variable declarations are outside the parallel block, they are, by default, *shared* across threads. This means any thread can modify these variables at any time, which is potentially dangerous. So here, `thread_id` may hold the value for another thread identifier when it's printed, since there is an opportunity between its assignment and its access within `printf` to be changed in another thread. This could be particularly problematic with a much larger data set and complex processing of that data, @@ -136,9 +133,7 @@ where it might not be obvious that incorrect behaviour has happened at all, and lead to incorrect results. This is known as a *race condition*, and we'll look into them in more detail in the next episode. -But there’s another common scenario to watch out for. What happens when we want to declare a variable -outside the parallel region, make it private, and retain its initial value inside the block? -Let’s consider the following example: +But there’s another common scenario to watch out for. What happens when we want to declare a variable outside the parallel region, make it private, and retain its initial value inside the block? Let’s consider the following example: ```c int initial_value = 15; @@ -148,15 +143,14 @@ int initial_value = 15; printf("Thread %d sees initial_value = %d\n", omp_get_thread_num(), initial_value); } ``` -You might expect each thread to start with `initial_value` set to `15`. -However, this is not the case. When a variable is declared as `private`, each thread gets its own copy -of the variable, but those copies are **uninitialised**—they don’t inherit the value from the variable outside -the parallel region. As a result, the output may vary and include seemingly random numbers, depending on the -compiler and runtime. -To handle this, you can use the `firstprivate` directive. With `firstprivate`, each thread gets its own -private copy of the variable, and those copies are initialised with the value from the variable outside the -parallel region. For example: +You might expect each thread to start with `initial_value` set to `15`. +However, this is not the case. When a variable is declared as `private`, each thread gets its own copy +of the variable, but those copies are **uninitialised**—they don’t inherit the value from the variable outside +the parallel region. As a result, the output may vary and include seemingly random numbers, depending on the compiler and runtime. + +To handle this, you can use the `firstprivate` directive. With `firstprivate`, each thread gets its own private copy of the variable, +and those copies are initialised with the value from the variable outside the parallel region. For example: ```c int initial_value = 15; @@ -166,17 +160,20 @@ int initial_value = 15; printf("Thread %d sees initial_value = %d\n", omp_get_thread_num(), initial_value); } ``` + Now, the initial value is correctly passed to each thread: ```text + Thread 0 sees initial_value = 15 Thread 1 sees initial_value = 15 Thread 2 sees initial_value = 15 Thread 3 sees initial_value = 15 ``` -Each thread begins with initial_value set to `15`. This avoids the unpredictable -behavior of uninitialised variables and ensures that the initial value is preserved for each thread. + +Each thread begins with initial_value set to `15`. This avoids the unpredictable behavior of uninitialised +variables and ensures that the initial value is preserved for each thread. ::::callout @@ -266,9 +263,8 @@ and how to specify different scheduling behaviours. ## Nested Loops with `collapse` -By default, OpenMP parallelises only the outermost loop in a nested structure. This works fine for many cases, -but what if the outer loop doesn’t have enough iterations to keep all threads busy, or the inner loop does most -of the work? In these situations, we can use the `collapse` clause to combine the iteration +By default, OpenMP parallelises only the outermost loop in a nested structure. This works fine for many cases, +but what if the outer loop doesn’t have enough iterations to keep all threads busy, or the inner loop does most of the work? In these situations, we can use the `collapse` clause to combine the iteration spaces of multiple loops into a single loop for parallel execution. For example, consider a nested loop structure: @@ -281,8 +277,9 @@ for (int i = 0; i < N; i++) { } } ``` -Without the `collapse` clause, the outer loop is divided into `N` iterations, and the inner loop is executed sequentially -within each thread. If `N` is small or `M` contains the bulk of the work, some threads might finish their work quickly + +Without the `collapse` clause, the outer loop is divided into `N` iterations, and the inner loop is executed sequentially +within each thread. If `N` is small or `M` contains the bulk of the work, some threads might finish their work quickly and sit idle, waiting for others to complete. This imbalance can slow down the overall execution of the program. Adding `collapse` changes this: @@ -295,8 +292,9 @@ for (int i = 0; i < N; i++) { } } ``` -The number `2` in `collapse(2)` specifies how many nested loops to combine. -Here, the two loops `(i and j)` are combined into a single iteration space with `N * M` iterations. + +The number `2` in `collapse(2)` specifies how many nested loops to combine. +Here, the two loops `(i and j)` are combined into a single iteration space with `N * M` iterations. These iterations are then distributed across the threads, ensuring a more balanced workload. ::: @@ -410,22 +408,22 @@ for (int i = 0; i < NUM_ITERATIONS; ++i) { ## How the `auto` Scheduler Works -The `auto` scheduler lets the compiler or runtime system automatically decide the best way to distribute work among threads. This is really convenient because -you don’t have to manually pick a scheduling method—the system handles it for you. It’s especially handy if your workload distribution is uncertain or changes a -lot. But keep in mind that how well `auto` works can depend a lot on the compiler. Not all compilers optimize equally well, and there might be a bit of overhead -as the runtime figures out the best scheduling method, which could affect performance in highly optimized code. - +The `auto` scheduler lets the compiler or runtime system automatically decide the best way to distribute work among threads. +This is really convenient because you don’t have to manually pick a scheduling method—the system handles it for you. +It’s especially handy if your workload distribution is uncertain or changes a lot. But keep in mind that how +well `auto` works can depend a lot on the compiler. Not all compilers optimize equally well, and there might be a bit of overhead +as the runtime figures out the best scheduling method, which could affect performance in highly optimized code. + The [OpenMP documentation](https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf) states that with `schedule(auto)`, the scheduling decision is left to the compiler or runtime system. So, how does the compiler make this decision? When using GCC, which is common in many environments including HPC, the `auto` scheduler often maps to `static` scheduling. This means it splits the work into equal chunks ahead of time for simplicity and performance. `static` scheduling is straightforward and has low overhead, which often leads to efficient execution for many applications. -However, specialised HPC compilers, like those from Intel or IBM, might handle `auto` differently. These advanced compilers can dynamically adjust the scheduling +However, specialised HPC compilers, like those from Intel or IBM, might handle `auto` differently. These advanced compilers can dynamically adjust the scheduling method during runtime, considering things like workload variability and specific hardware characteristics to optimize performance. -So, when should you use `auto`? It’s great during development for quick performance testing without having to manually adjust scheduling methods. It’s also -useful in environments where the workload changes a lot, letting the runtime adapt the scheduling as needed. While `auto` can make your code simpler, it’s -important to test different schedulers to see which one works best for your specific application. - +So, when should you use `auto`? It’s great during development for quick performance testing without having to manually adjust scheduling methods. It’s also +useful in environments where the workload changes a lot, letting the runtime adapt the scheduling as needed. While `auto` can make your code simpler, it’s +important to test different schedulers to see which one works best for your specific application. -::: +::: ::::challenge{id=differentschedulers, title="Try Out Different Schedulers"} @@ -516,4 +514,4 @@ Then rerun. Try it with different chunk sizes too, e.g.: export OMP_SCHEDULE=static,1 ``` -::: \ No newline at end of file +::: diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index 0c89b08b..21d3aa41 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -130,18 +130,17 @@ as the calculation depends on this data (See the full code [here](code/examples/ do_main_calculation(thread_id); } ``` -Similarly, in iterative tasks like matrix calculations, barriers help coordinate threads so that all updates -are completed before moving to the next step. For example, in the following snippet, each thread updates its assigned -row of a matrix using data from its current row and the row above (except for the first row, which has no dependency; -see the full code [here](code/examples/04-matrix-update.c)). A barrier ensures that all threads finish updating -their rows in `next_matrix` before the values are copied back into `current_matrix`. Without this barrier, threads might -read outdated or partially updated data, causing inconsistencies. - -Here, the number of rows (`nx`) is dynamically determined at runtime using `omp_get_max_threads()`. This function provides -the maximum number of threads OpenMP can use in a parallel region, based on the system's resources and runtime configuration. -Using this value, we define the number of rows in the matrix, with each row corresponding to a potential thread. This setup -ensures that both the `current_matrix` and `next_matrix provide` rows for the maximum number of threads allocated during -parallel execution. + +Similarly, in iterative tasks like matrix calculations, barriers help coordinate threads so that all updates +are completed before moving to the next step. For example, in the following snippet, each thread updates its assigned +row of a matrix using data from its current row and the row above (except for the first row, which has no dependency; +see the full code [here](code/examples/04-matrix-update.c)). A barrier ensures that all threads finish updating their rows in `next_matrix` before the values are +copied back into `current_matrix`. Without this barrier, threads might read outdated or partially updated data, causing inconsistencies. + +Here, the number of rows (`nx`) is dynamically determined at runtime using `omp_get_max_threads()`. This function provides +the maximum number of threads OpenMP can use in a parallel region, based on the system's resources and runtime configuration. +Using this value, we define the number of rows in the matrix, with each row corresponding to a potential thread. This setup +ensures that both the `current_matrix` and `next_matrix provide` rows for the maximum number of threads allocated during parallel execution. ```c ...... @@ -171,11 +170,12 @@ int main() { print_matrix("Final Matrix", current_matrix, nx); } ``` + :::callout{variant='note'} -OpenMP does not allow barriers to be placed directly inside `#pragma omp parallel for` loops due to restrictions -on closely [nested regions](https://www.openmp.org/spec-html/5.2/openmpse101.html#x258-27100017.1). To coordinate threads -effectively in iterative tasks like this, we use a `#pragma omp parallel` construct, which gives explicit control over -the loop and allows proper barrier placement. + +OpenMP does not allow barriers to be placed directly inside `#pragma omp parallel for` loops due to restrictions on +closely [nested regions](https://www.openmp.org/spec-html/5.2/openmpse101.html#x258-27100017.1). To coordinate threads effectively in iterative tasks like this, we use a `#pragma omp parallel` construct, +which gives explicit control over the loop and allows proper barrier placement. ::: Barriers introduce additional overhead into our parallel algorithms, as some threads will be idle whilst waiting for @@ -247,17 +247,17 @@ write the result to disk (See the full code [here](code/examples/04-single-regio ``` :::callout{variant='note'} -OpenMP has a restriction: you cannot use `#pragma omp single` or `#pragma omp master` directly inside a `#pragma omp parallel for` -loop. If you attempt this, you'll encounter an error because OpenMP does not allow these regions to be **"closely nested"** -within a parallel loop. However, there’s a useful workaround: move the `single` or `master` region into a separate function -and call that function from within the loop. This approach works because OpenMP allows these regions when they are not -explicitly part of the loop structure + +OpenMP has a restriction: you cannot use `#pragma omp single` or `#pragma omp master` directly inside a `#pragma omp parallel for` loop. +If you attempt this, you'll encounter an error because OpenMP does not allow these regions to be **"closely nested"** within a parallel loop. +However, there’s a useful workaround: move the `single` or `master` region into a separate function and call that function from within the +loop. This approach works because OpenMP allows these regions when they are not explicitly part of the loop structure + ::: -If we wanted to sum up something in parallel (e.g., a reduction operation like summing an array), we would need to use a -critical region to prevent a race condition when threads update the reduction variable-the shared variable that stores -the final result. In the 'Identifying Race Conditions' challenge earlier, we saw that multiple threads -updating the same variable (**value**) at the same time caused inconsistencies—a classic race condition. +If we wanted to sum up something in parallel (e.g., a reduction operation like summing an array), we would need to use a critical region to +prevent a race condition when threads update the reduction variable-the shared variable that stores the final result. In the 'Identifying Race Conditions' challenge +earlier, we saw that multiple threads updating the same variable (**value**) at the same time caused inconsistencies—a classic race condition. This problem can be fixed by using a critical region, which allows threads to update **value** one at a time. For example: ```c @@ -272,20 +272,20 @@ for (int i = 0; i < NUM_TIMES; ++i) { } ``` -However, while this approach eliminates the race condition, it introduces synchronisation overhead. -For lightweight operations like summing values, this overhead can outweigh the benefits of parallelisation. +However, while this approach eliminates the race condition, it introduces synchronisation overhead. For lightweight operations like +summing values, this overhead can outweigh the benefits of parallelisation. :::callout ### Reduction Clauses -A more efficient way to handle tasks like summing values is to use OpenMP's `reduction` clause. -Unlike the critical region approach, the `reduction` clause avoids explicit synchronisation by -allowing each thread to work on its own private copy of the variable. Once the loop finishes, -OpenMP combines these private copies into a single result. This not only simplifies the code but also avoids delays +A more efficient way to handle tasks like summing values is to use OpenMP's `reduction` clause. +Unlike the critical region approach, the `reduction` clause avoids explicit synchronisation by +allowing each thread to work on its own private copy of the variable. Once the loop finishes, +OpenMP combines these private copies into a single result. This not only simplifies the code but also avoids delays caused by threads waiting to access the shared variable. -For example, instead of using a critical region to sum values, we can rewrite the code with a `reduction` clause +For example, instead of using a critical region to sum values, we can rewrite the code with a `reduction` clause as shown below: ```c @@ -306,20 +306,24 @@ int main() { return 0; } - ``` -Here, the `reduction(+:value)` directive does the work for us. During the loop, each thread maintains its -own copy of value, avoiding any chance of a race condition. When the loop ends, OpenMP automatically sums + +Here, the `reduction(+:value)` directive does the work for us. During the loop, each thread maintains its +own copy of value, avoiding any chance of a race condition. When the loop ends, OpenMP automatically sums up the private copies into the shared variable value. This means the output will always be correct—in this case, **10000**. ::: -::::challenge{id=reportingprogress, title="Reporting progress"} -The code below attempts to track the progress of a parallel loop using a shared counter, `progress`. -However, it has a problem: the final value of progress is often incorrect, and the progress updates might be +::::challenge{id=reportingprogress title="Reporting progress"} + +The code below attempts to track the progress of a parallel loop using a shared counter, `progress`. +However, it has a problem: the final value of progress is often incorrect, and the progress updates might be inconsistent. + 1. Can you identify the issue with the current implementation? + 2. How would you fix it to ensure the progress counter works correctly and updates are synchronised? -3. After fixing the issue, experiment with different loop schedulers (`static`, `dynamic`, `guided` and `auto`) + +3. After fixing the issue, experiment with different loop schedulers (`static`, `dynamic`, `guided` and `auto`) to observe how they affect progress reporting. - What changes do you notice in the timing and sequence of updates when using these schedulers? - Which scheduler produces the most predictable progress updates? @@ -350,6 +354,7 @@ int main(int argc, char **argv) { return 0; } ``` + NB: To compile this you’ll need to add `-lm` to inform the linker to link to the math C library, e.g. ```bash @@ -357,17 +362,18 @@ gcc counter.c -o counter -fopenmp -lm ``` :::solution -1. The above program tracks progress using a shared counter (`progress++`) inside the loop, -but it does so without synchronisation, leading to a race condition. Since multiple threads can access and modify progress at the same time, the final value of progress will likely be incorrect. -This happens because the updates to progress are not synchronised across threads. As a result, the final value of -`progress` is often incorrect and varies across runs. You might see output like this: + +**1.** The above program tracks progress using a shared counter (`progress++`) inside the loop, +but it does so without synchronisation, leading to a race condition. Since multiple threads can access and modify progress at the same time, the final value of progress will likely be incorrect. +This happens because the updates to progress are not synchronised across threads. As a result, the final value of +`progress` is often incorrect and varies across runs. You might see output like this: ```text Final progress: 9983 (Expected: 10000) ``` -2. To fix this issue, we use a critical region to synchronise updates to progress. -We also introduce a second variable, `output_frequency`, to control how often progress updates are reported +**2.** To fix this issue, we use a critical region to synchronise updates to progress. +We also introduce a second variable, `output_frequency`, to control how often progress updates are reported (e.g., every 10% of the total iterations). The corrected version: @@ -402,17 +408,18 @@ int main(int argc, char **argv) { return 0; } ``` -This implementation resolves the race condition by ensuring that only one thread can modify progress at a time. -However, this solution comes at a cost: **synchronisation overhead**. Every iteration requires threads to enter the -critical region, and if the loop body is lightweight (e.g., simple calculations), this overhead may outweigh the -computational benefits of parallelisation. For example, if each iteration takes only a few nanoseconds to compute, + +This implementation resolves the race condition by ensuring that only one thread can modify progress at a time. +However, this solution comes at a cost: **synchronisation overhead**. Every iteration requires threads to enter the +critical region, and if the loop body is lightweight (e.g., simple calculations), this overhead may outweigh the +computational benefits of parallelisation. For example, if each iteration takes only a few nanoseconds to compute, the time spent waiting for access to the critical region might dominate the runtime. -3. The static scheduler, used in the corrected version, divides iterations evenly among threads. This ensures predictable -and consistent progress updates. For instance, progress increments occur at regular intervals (e.g., 10%, 20%, etc.), +**3.** The static scheduler, used in the corrected version, divides iterations evenly among threads. This ensures predictable +and consistent progress updates. For instance, progress increments occur at regular intervals (e.g., 10%, 20%, etc.), producing output like: -``` +```text Progress: 10% Progress: 20% Progress: 30% @@ -420,33 +427,35 @@ Progress: 30% Final progress: 10000 (Expected: 10000) ``` -When experimenting with other schedulers, such as `dynamic` or `guided`, +When experimenting with other schedulers, such as `dynamic` or `guided`, the timing and sequence of updates change due to differences in how iterations are assigned to threads. -With the `dynamic` scheduler, threads are assigned smaller chunks of iterations as they finish their current work. -This can lead to progress updates appearing irregular, as threads complete their chunks at varying speeds based on +With the `dynamic` scheduler, threads are assigned smaller chunks of iterations as they finish their current work. +This can lead to progress updates appearing irregular, as threads complete their chunks at varying speeds based on workload. For example: -``` +```text Progress: 15% Progress: 30% Progress: 55% ... Final progress: 10000 (Expected: 10000) ``` -Using the `guided` scheduler results in yet another pattern. Threads start with larger chunks of iterations, -and the chunk size decreases as the loop progresses. This often leads to progress updates being sparse at the start but + +Using the `guided` scheduler results in yet another pattern. Threads start with larger chunks of iterations, +and the chunk size decreases as the loop progresses. This often leads to progress updates being sparse at the start but becoming more frequent toward the end of the loop: -``` +```text Progress: 25% Progress: 70% Progress: 100% Final progress: 10000 (Expected: 10000) ``` -The `auto` scheduler, on the other hand, leaves the decision about iteration assignment to the OpenMP runtime system. -This provides flexibility, as the runtime adapts the scheduling to optimise for the specific platform and workload. -However, because `auto` is implementation-dependent, the timing and predictability of progress updates can vary and + +The `auto` scheduler, on the other hand, leaves the decision about iteration assignment to the OpenMP runtime system. +This provides flexibility, as the runtime adapts the scheduling to optimise for the specific platform and workload. +However, because `auto` is implementation-dependent, the timing and predictability of progress updates can vary and are harder to generalise. ::: :::: @@ -466,7 +475,7 @@ race conditions. But in some cases, critical regions may not be flexible or gran amount of serialisation. If this is the case, we can use *locks* instead to achieve the same effect as a critical region. Locks are a mechanism in OpenMP which, just like a critical regions, create regions in our code which only one thread can be in at one time. The main advantage of locks, compared to critical regions, is that they provide more -granular control over thread synchronisation by protecting different-sized or fragmented regions of code, therefore +granular control over thread synchronisation by protecting different-sized or fragmented regions of code, therefore allowing greater flexibility. Locks are also far more flexible when it comes to making our code more modular, as it is possible to nest locks, or for accessing and modifying global variables. @@ -527,10 +536,10 @@ forget to or unset a lock in the wrong place. ### Atomic operations Another mechanism is atomic operations. In computing, an atomic operation is an operation which is performed without -interruption, meaning that once initiated, it is guaranteed to execute without interference from other operations. -In OpenMP, this refers to operations that execute without interference from other threads. If we make an operation modifying a value in -an array atomic, the compiler guarantees that no other thread can read or modify that array until the atomic operation is -finished. You can think of it as a thread having temporary exclusive access to something in our program, similar to a +interruption, meaning that once initiated, it is guaranteed to execute without interference from other operations. +In OpenMP, this refers to operations that execute without interference from other threads. If we make an operation modifying a value in +an array atomic, the compiler guarantees that no other thread can read or modify that array until the atomic operation is +finished. You can think of it as a thread having temporary exclusive access to something in our program, similar to a 'one at a time' rule for accessing and modifying parts of the program. To do an atomic operation, we use the `omp atomic` pragma before the operation we want to make atomic. @@ -547,22 +556,19 @@ int main() { /* Put the pragma before the shared variable */ #pragma omp parallel { - #pragma omp atomic - shared_variable += 1; - printf("Shared variable updated: %d\n", shared_variable); + #pragma omp atomic + shared_variable += 1; + printf("Shared variable updated: %d\n", shared_variable); } /* Can also use in a parallel for */ #pragma omp parallel for for (int i = 0; i < 4; ++i) { - #pragma omp atomic - shared_array[i] += 1; - printf("Shared array element %d updated: %d\n", i, shared_array[i]); - + #pragma omp atomic + shared_array[i] += 1; + printf("Shared array element %d updated: %d\n", i, shared_array[i]); } - } - ``` Atomic operations are for single line operations or piece of code. As in the example above, we can do an atomic @@ -587,7 +593,7 @@ Critical regions and locks are more appropriate when: Atomic operations are good when: -- The operation which needs synchronisation is simple, such as needing to protect a single variable update in the parallel +- The operation which needs synchronisation is simple, such as needing to protect a single variable update in the parallel algorithm. - There is low contention for shared data. - When you need to be as performant as possible, as atomic operations generally have the lowest performance cost. @@ -632,7 +638,7 @@ int main(int argc, char **argv) { When we run the program multiple times, we expect the output `sum` to have the value of `0.000000`. However, due to an existing race condition, the program can sometimes produce wrong output in different runs, as shown below: -``` +```text 1. Sum: 1.000000 2. Sum: -1.000000 3. Sum: 2.000000 diff --git a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md index ab1c3312..7d4953c5 100644 --- a/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md +++ b/high_performance_computing/hpc_openmp/05_hybrid_parallelism.md @@ -235,11 +235,11 @@ implementing MPI. In this example, we wil be porting an OpenMP code to a hybrid also done this the other way around by porting an MPI code into a hybrid application. Neither *"evolution"* is more common nor better than the other, the route each code takes toward becoming hybrid is different. -So, how do we split work using a hybrid approach? For an embarrassingly parallel problem, such as the one we're working on, -we can split the problem size into smaller chunks across MPI ranks and use OpenMP to parallelise the work. For example, consider -a problem where we have to do a calculation for 1,000,000 input parameters. If we have four MPI ranks each of which will spawn 10 threads, -we could split the work evenly between MPI ranks so each rank will deal with 250,000 input parameters. We will then use OpenMP -threads to do the calculations in parallel. If we use a sequential scheduler, then each thread will do 25,000 calculations. Or we +So, how do we split work using a hybrid approach? For an embarrassingly parallel problem, such as the one we're working on, +we can split the problem size into smaller chunks across MPI ranks and use OpenMP to parallelise the work. For example, consider +a problem where we have to do a calculation for 1,000,000 input parameters. If we have four MPI ranks each of which will spawn 10 threads, +we could split the work evenly between MPI ranks so each rank will deal with 250,000 input parameters. We will then use OpenMP +threads to do the calculations in parallel. If we use a sequential scheduler, then each thread will do 25,000 calculations. Or we could use OpenMP's dynamic scheduler to automatically balance the workload. We have implemented this situation in the code example below. ```c @@ -378,7 +378,7 @@ Total time = 5.818889 seconds Ouch, this took longer to run than the pure OpenMP implementation (although only marginally longer in this example!). You may have noticed that we have 8 MPI ranks, each of which are spawning 8 of their own OpenMP threads. This is an important thing to realise. When you specify the number of threads for OpenMP to use, this is the number of threads -*each* MPI process will spawn. So why did it take longer? With each of the 8 MPI ranks spawning 8 threads, 64 threads +*each* MPI process will spawn. So why did it take longer? With each of the 8 MPI ranks spawning 8 threads, 64 threads were in flight. More threads means more overheads and if, for instance, we have 8 CPU Cores, then contention arises as each thread competes for access to a CPU core. @@ -448,4 +448,4 @@ was, rather naturally, when either $N_{\mathrm{ranks}} = 1$, $N_{\mathrm{threads $N_{\mathrm{threads}} = 1$ with the former being slightly faster. Otherwise, we found the best balance was $N_{\mathrm{ranks}} = 2$, $N_{\mathrm{threads}} = 3$. ::: -:::: \ No newline at end of file +:::: From cafae7d0a254df5550e935349c4e039364c8bbda Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 10 Dec 2024 08:48:56 +0000 Subject: [PATCH 21/34] Fix markdown linting issues for MPI --- .../hpc_mpi/02_mpi_api.md | 17 +-- .../hpc_mpi/03_communicating_data.md | 40 +++---- .../04_point_to_point_communication.md | 41 ++++--- .../hpc_mpi/05_collective_communication.md | 64 ++++++----- .../hpc_mpi/06_non_blocking_communication.md | 32 +++--- .../hpc_mpi/07-derived-data-types.md | 103 +++++++++--------- .../hpc_mpi/08_porting_serial_to_mpi.md | 7 +- .../hpc_mpi/09_optimising_mpi.md | 24 ++-- .../hpc_mpi/10_communication_patterns.md | 9 +- .../hpc_mpi/11_advanced_communication.md | 21 +++- 10 files changed, 196 insertions(+), 162 deletions(-) diff --git a/high_performance_computing/hpc_mpi/02_mpi_api.md b/high_performance_computing/hpc_mpi/02_mpi_api.md index 21c8c888..9140d3ef 100644 --- a/high_performance_computing/hpc_mpi/02_mpi_api.md +++ b/high_performance_computing/hpc_mpi/02_mpi_api.md @@ -40,9 +40,9 @@ Since its inception, MPI has undergone several revisions, each introducing new f These revisions, along with subsequent updates and errata, have refined the MPI standard, making it more robust, versatile, and efficient. :::: -Today, various MPI implementations are available, each tailored to specific hardware architectures and systems. Popular implementations like [MPICH](https://www.mpich.org/), -[Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.0tevpk), -[IBM Spectrum MPI](https://www.ibm.com/products/spectrum-mpi), [MVAPICH](https://mvapich.cse.ohio-state.edu/) and +Today, various MPI implementations are available, each tailored to specific hardware architectures and systems. Popular implementations like [MPICH](https://www.mpich.org/), +[Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.0tevpk), +[IBM Spectrum MPI](https://www.ibm.com/products/spectrum-mpi), [MVAPICH](https://mvapich.cse.ohio-state.edu/) and [Open MPI](https://www.open-mpi.org/) offer optimised performance, portability, and flexibility. For instance, MPICH is known for its efficient scalability on HPC systems, while Open MPI prioritises extensive portability and adaptability, providing robust support for multiple operating systems, programming languages, and hardware platforms. @@ -127,10 +127,10 @@ HPC clusters don't usually have GUI-based IDEs installed on them. We can write code locally, and copy it across using `scp` or `rsync`, but most IDEs have the ability to open folders on a remote machine, or to automatically synchronise a local folder with a remote one. For **VSCode**, the [Remote-SSH](https://code.visualstudio.com/docs/remote/ssh) extension provides most of the functionality of a regular VSCode window, but on a remote machine. -Some older Linux systems don't support it - in that case, try the +Some older Linux systems don't support it - in that case, try the [SSH FS](https://marketplace.visualstudio.com/items?itemName=Kelvin.vscode-sshfs) extension instead. -Other IDEs like **CLion** also support +Other IDEs like **CLion** also support [a variety of remote development methods](https://www.jetbrains.com/help/clion/remote-development.html). :::: @@ -208,7 +208,7 @@ As we've just learned, running a program with `mpiexec` or `mpirun` results in t mpirun -n 4 ./hello_world ``` -However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. +However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. For the copies to work together, they need to know about their role in the computation, in order to properly take advantage of parallelisation. This usually also requires knowing the total number of tasks running at the same time. - The program needs to call the `MPI_Init()` function. @@ -235,6 +235,7 @@ int main(void) { return MPI_Finalize(); } ``` + :::: After MPI is initialised, you can find out the total number of ranks and the specific rank of the copy: @@ -382,6 +383,6 @@ For this we would need ways for ranks to communicate - the primary benefit of MP ## What About Python? -In [MPI for Python (mpi4py)](https://mpi4py.readthedocs.io/en/stable/), +In [MPI for Python (mpi4py)](https://mpi4py.readthedocs.io/en/stable/), the initialisation and finalisation of MPI are handled by the library, and the user can perform MPI calls after ``from mpi4py import MPI``. -:::: \ No newline at end of file +:::: diff --git a/high_performance_computing/hpc_mpi/03_communicating_data.md b/high_performance_computing/hpc_mpi/03_communicating_data.md index fd768d90..bfe7e258 100644 --- a/high_performance_computing/hpc_mpi/03_communicating_data.md +++ b/high_performance_computing/hpc_mpi/03_communicating_data.md @@ -48,7 +48,7 @@ finished, similar to read receipts in e-mails and instant messages. Consider a simulation where each rank calculates the physical properties for a subset of cells on a very large grid of points. One step of the calculation needs to know the average temperature across the entire grid of points. How would you approach calculating the average temperature? ::::solution -There are multiple ways to approach this situation, but the most efficient approach would be to use collective operations to send the average temperature to a main rank which performs the final calculation. You can, of course, also use a point-to-point pattern, but it would be less efficient, especially with a large number of ranks. +There are multiple ways to approach this situation, but the most efficient approach would be to use collective operations to send the average temperature to a main rank which performs the final calculation. You can, of course, also use a point-to-point pattern, but it would be less efficient, especially with a large number of ranks. If the simulation wasn't done in parallel, or was instead using shared-memory parallelism, such as OpenMP, we wouldn't need to do any communication to get the data required to calculate the average. :::: @@ -85,13 +85,13 @@ types are in the table below: | MPI_LONG_DOUBLE | long double | | MPI_BYTE | char | -Remember, these constants aren't the same as the primitive types in C, so we can't use them to create variables, e.g., +Remember, these constants aren't the same as the primitive types in C, so we can't use them to create variables, e.g., ```c MPI_INT my_int = 1; ``` -is not valid code because, under the hood, these constants are actually special data structures used by MPI. +is not valid code because, under the hood, these constants are actually special data structures used by MPI. Therefore, we can only them as arguments in MPI functions. ::::callout @@ -111,10 +111,11 @@ need to change the type, you would only have to do it in one place, e.g.: // use them as you would normally INT_TYPE my_int = 1; ``` + :::: -Derived data types, on the other hand, are very similar to C structures which we define by using the basic MPI data types. -They are often useful for grouping similar data in communications or when sending a structure from one rank to another. +Derived data types, on the other hand, are very similar to C structures which we define by using the basic MPI data types. +They are often useful for grouping similar data in communications or when sending a structure from one rank to another. This is covered in more detail in the optional [Advanced Communication Techniques](11_advanced_communication.md) episode. :::::challenge{id=what-type, title="What Type Should You Use?"} @@ -131,6 +132,7 @@ The fact that `a[]` is an array does not matter, because all the elements in `a[ 1. `MPI_INT` 2. `MPI_DOUBLE` - `MPI_FLOAT` would not be correct as `float`'s contain 32 bits of data whereas `double`s are 64 bit. 3. `MPI_BYTE` or `MPI_CHAR` - you may want to use [strlen](https://man7.org/linux/man-pages/man3/strlen.3.html) to calculate how many elements of `MPI_CHAR` being sent. + :::: ::::: @@ -147,15 +149,14 @@ int my_rank; MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // MPI_COMM_WORLD is the communicator the rank belongs to ``` -In addition to `MPI_COMM_WORLD`, we can make sub-communicators and distribute ranks into them. Messages can only be sent and received to and from the same communicator, effectively isolating messages to a communicator. For most applications, we usually don't need anything other than `MPI_COMM_WORLD`. But organising ranks into communicators can be helpful in some circumstances, as you can create small "work units" of multiple ranks to dynamically schedule the workload, or to help compartmentalise the problem into smaller chunks by using a +In addition to `MPI_COMM_WORLD`, we can make sub-communicators and distribute ranks into them. Messages can only be sent and received to and from the same communicator, effectively isolating messages to a communicator. For most applications, we usually don't need anything other than `MPI_COMM_WORLD`. But organising ranks into communicators can be helpful in some circumstances, as you can create small "work units" of multiple ranks to dynamically schedule the workload, or to help compartmentalise the problem into smaller chunks by using a [virtual cartesian topology](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node192.htm#Node192). Throughout this course, we will stick to using `MPI_COMM_WORLD`. ## Communication modes -When sending data between ranks, MPI will use one of four communication modes: synchronous, buffered, ready or standard. When a communication function is called, it takes control of program execution until the send-buffer is safe to be re-used again. What this means is that it's safe to re-use the memory/variable you passed without affecting the data that is still being sent. If MPI didn't have this concept of safety, then you could quite easily overwrite or destroy any data before it is transferred fully! This would lead to some very strange behaviour which would be hard to debug. The difference between the communication mode is when the buffer becomes safe to re-use. MPI won't guess at which mode *should* be used. +When sending data between ranks, MPI will use one of four communication modes: synchronous, buffered, ready or standard. When a communication function is called, it takes control of program execution until the send-buffer is safe to be re-used again. What this means is that it's safe to re-use the memory/variable you passed without affecting the data that is still being sent. If MPI didn't have this concept of safety, then you could quite easily overwrite or destroy any data before it is transferred fully! This would lead to some very strange behaviour which would be hard to debug. The difference between the communication mode is when the buffer becomes safe to re-use. MPI won't guess at which mode *should* be used. That is up to the programmer. Therefore, each mode has an associated communication function: - | Mode | Blocking function | |-------------|-------------------| | Synchronous | `MPI_SSend()` | @@ -177,7 +178,7 @@ you and the person have both picked up the phone, had your conversation and hung Synchronous communication is typically used when you need to guarantee synchronisation, such as in iterative methods or time dependent simulations where it is vital to ensure consistency. It's also the easiest communication mode to develop -and debug with because of its predictable behaviour. +and debug with because of its predictable behaviour. ### Buffered sends @@ -189,7 +190,7 @@ the postbox. You are blocked from doing other tasks whilst you write and send th postbox, you carry on with other tasks and don't wait for the letter to be delivered! Buffered sends are good for large messages and for improving the performance of your communication patterns by taking -advantage of the asynchronous nature of the data transfer. +advantage of the asynchronous nature of the data transfer. ### Ready sends @@ -206,20 +207,20 @@ then you may have to repeat yourself to make sure your transmit the information ### Standard sends -The standard send mode is the most commonly used type of send, as it provides a balance between ease of use and performance. -Under the hood, the standard send is either a buffered or a synchronous send, depending on the availability of system resources (e.g. the size of the internal buffer) and which mode MPI has determined to be the most efficient. +The standard send mode is the most commonly used type of send, as it provides a balance between ease of use and performance. +Under the hood, the standard send is either a buffered or a synchronous send, depending on the availability of system resources (e.g. the size of the internal buffer) and which mode MPI has determined to be the most efficient. ::::callout ## Which mode should I use? -Each communication mode has its own use cases where it excels. However, it is often easiest, at first, to use +Each communication mode has its own use cases where it excels. However, it is often easiest, at first, to use the standard send, `MPI_Send()`, and optimise later. If the standard send doesn't meet your requirements, or if you need more control over communication, then pick which communication mode suits your requirements best. You'll probably need to experiment to find the best! :::: ::::callout{variant="note"} -## Communication mode summary: +## Communication mode summary | Mode | Description | Analogy | MPI Function | |-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------| @@ -227,6 +228,7 @@ the standard send, `MPI_Send()`, and optimise later. If the standard send doesn' | Buffered | Returns control immediately after copying the message to a buffer, regardless of whether the receive has happened or not. | Sending a letter or e-mail | `MPI_Bsend()` | | Ready | Returns control immediately, assuming the matching receive has already been posted. Can lead to errors if the receive is not ready. | Talking to someone you think/hope is listening | `MPI_Rsend()` | | Standard | Returns control when it's safe to reuse the send buffer. May or may not wait for the matching receive (synchronous mode), depending on MPI implementation and message size. | Phone call or letter | `MPI_Send()` | + :::: ### Blocking vs. non-blocking communication @@ -249,11 +251,11 @@ communication is done -- and to not modify/use the data/variable/memory before t ## Is `MPI_Bsend()` non-blocking? -The buffered communication mode is a type of asynchronous communication, because the function returns before the data has been received by another rank. But, it's not a non-blocking call **unless** you use the non-blocking version +The buffered communication mode is a type of asynchronous communication, because the function returns before the data has been received by another rank. But, it's not a non-blocking call **unless** you use the non-blocking version `MPI_Ibsend()` (more on this later). Even though the data transfer happens in the background, allocating and copying data to the send buffer happens in the foreground, blocking execution of our program. On the other hand, `MPI_Ibsend()` is "fully" asynchronous because even allocating and copying data to the send buffer happens in the background. ::: -One downside to blocking communication is that if rank B is never listening for messages, rank A will become *deadlocked*. A deadlock +One downside to blocking communication is that if rank B is never listening for messages, rank A will become *deadlocked*. A deadlock happens when your program hangs indefinitely because the **send** (or **receive**) operation is unable to complete. Deadlocks can happen for countless number of reasons. For example, we might forget to write the corresponding **receive** function when sending data. Or a function may return earlier due to an error which isn't handled properly, or a @@ -264,12 +266,12 @@ impossible, but this does not stop any attempts to send data to crashed rank. ## Avoiding communication deadlocks -A common piece of advice in C is that when allocating memory using `malloc()`, always write the accompanying call to +A common piece of advice in C is that when allocating memory using `malloc()`, always write the accompanying call to `free()` to help avoid memory leaks by forgetting to deallocate the memory later. You can apply the same mantra to communication in MPI. When you send data, always write the code to receive the data as you may forget to later and accidentally cause a deadlock. :::: -Blocking communication works best when the work is balanced across ranks, so that each rank has an equal amount of things to do. A common pattern in scientific computing is to split a calculation across a grid and then to share the results between all ranks before moving onto the next calculation. +Blocking communication works best when the work is balanced across ranks, so that each rank has an equal amount of things to do. A common pattern in scientific computing is to split a calculation across a grid and then to share the results between all ranks before moving onto the next calculation. If the workload is well-balanced, each rank will finish at roughly the same time and be ready to transfer data at the same time. But, as shown in the diagram below, if the workload is unbalanced, some ranks will finish their calculations earlier and begin to send their data to the other ranks before they are ready to receive data. This means some ranks will be sitting around doing nothing whilst they wait for the other ranks to become ready to receive data, wasting computation time. ![Blocking communication](fig/blocking-wait.png) @@ -305,4 +307,4 @@ Until the other person responds, we are stuck waiting for the response. Sending e-mails or letters in the post is a form of non-blocking communication we're all familiar with. When we send an e-mail, or a letter, we don't wait around to hear back for a response. We instead go back to our lives and start doing tasks instead. We can periodically check our e-mail for the response, and either keep doing other tasks or continue our previous task once we've received a response back from our e-mail. :::: -::::: \ No newline at end of file +::::: diff --git a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md index 4e111696..605285c0 100644 --- a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md +++ b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md @@ -14,7 +14,7 @@ learningOutcomes: --- In the previous episode we introduced the various types of communication in MPI. -In this section we will use the MPI library functions `MPI_Send()` and `MPI_Recv()`, which employ point-to-point communication, +In this section we will use the MPI library functions `MPI_Send()` and `MPI_Recv()`, which employ point-to-point communication, to send data from one rank to another. ![Sending data from one rank to another using MPI_SSend and MPI_Recv()](fig/send-recv.png) @@ -44,6 +44,7 @@ int MPI_Send( MPI_Comm communicator ) ``` + | | | |-----------------|---------------------------------------------------------------------------------------------------------------------------------------------| | `*data`: | Pointer to the start of the data being sent. We would not expect this to change, hence it's defined as `const` | @@ -53,7 +54,6 @@ int MPI_Send( | `tag`: | An message tag (integer), which is used to differentiate types of messages. We can specify `0` if we don't need different types of messages | | `communicator`: | The communicator, e.g. MPI_COMM_WORLD as seen in previous episodes | - For example, if we wanted to send a message that contains `"Hello, world!\n"` from rank 0 to rank 1, we could state (assuming we were rank 0): @@ -62,8 +62,8 @@ char *message = "Hello, world!\n"; MPI_Send(message, 14, MPI_CHAR, 1, 0, MPI_COMM_WORLD); ``` -So we are sending 14 elements of `MPI_CHAR()` one time, and specified `0` for our message tag since we don't anticipate -having to send more than one type of message. This call is synchronous, and will block until the corresponding `MPI_Recv()` +So we are sending 14 elements of `MPI_CHAR()` one time, and specified `0` for our message tag since we don't anticipate +having to send more than one type of message. This call is synchronous, and will block until the corresponding `MPI_Recv()` operation receives and acknowledges receipt of the message. ::::callout @@ -117,10 +117,11 @@ MPI_Status status; MPI_Recv(message, 14, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status); message[14] = '\0'; ``` -Here, we create a buffer `message` to store the received data and initialise it to zeros (`{0}`) to prevent -any garbage content. We then call `MPI_Recv()` to receive the message, specifying the source rank (`0`), the message tag (`0`), -and the communicator (`MPI_COMM_WORLD`). The status object is passed to capture details about the received message, such as -the actual source rank or tag, though it is not used in this example. To ensure safe string handling, + +Here, we create a buffer `message` to store the received data and initialise it to zeros (`{0}`) to prevent +any garbage content. We then call `MPI_Recv()` to receive the message, specifying the source rank (`0`), the message tag (`0`), +and the communicator (`MPI_COMM_WORLD`). The status object is passed to capture details about the received message, such as +the actual source rank or tag, though it is not used in this example. To ensure safe string handling, we explicitly null-terminate the received message by setting `message[14] = '\0'`. Let's put this together with what we've learned so far. @@ -164,9 +165,10 @@ int main(int argc, char** argv) { return MPI_Finalize(); } ``` + :::callout{variant='warning'} -When using `MPI_Recv()` to receive string data, ensure that your buffer is large enough to hold the message and -includes space for a null terminator. Explicitly initialising the buffer and adding the null terminator avoids +When using `MPI_Recv()` to receive string data, ensure that your buffer is large enough to hold the message and +includes space for a null terminator. Explicitly initialising the buffer and adding the null terminator avoids undefined behavior or garbage output. ::: @@ -174,7 +176,7 @@ undefined behavior or garbage output. ## MPI Data Types in C -In the above example we send a string of characters and therefore specify the type `MPI_CHAR`. For a complete list of types, +In the above example we send a string of characters and therefore specify the type `MPI_CHAR`. For a complete list of types, see [the MPICH documentation](https://www.mpich.org/static/docs/v3.3/www3/Constants.html). :::: @@ -307,8 +309,10 @@ int main(int argc, char **argv) { return MPI_Finalize(); } ``` -Note: In MPI programs, every rank runs the same code. To make ranks behave differently, you must + +Note: In MPI programs, every rank runs the same code. To make ranks behave differently, you must explicitly program that behavior based on their rank ID. For example: + - Use conditionals like `if (rank == 0)` to define specific actions for rank 0. - All other ranks can perform different actions in an `else` block. @@ -351,14 +355,15 @@ int main(int argc, char **argv) { return MPI_Finalize(); } ``` -Here rank 0 calls `MPI_Recv` to gather messages, while other ranks call `MPI_Send` to send their messages. + +Here rank 0 calls `MPI_Recv` to gather messages, while other ranks call `MPI_Send` to send their messages. Without this differentiation, ranks will attempt the same actions, potentially causing errors or deadlocks. :::: ::::: :::callout{variant='note'} -If you don't require the additional information provided by `MPI_Status`, such as source or tag, -you can use `MPI_STATUS_IGNORE` in `MPI_Recv` calls. This simplifies your code by removing the need to declare and manage an +If you don't require the additional information provided by `MPI_Status`, such as source or tag, +you can use `MPI_STATUS_IGNORE` in `MPI_Recv` calls. This simplifies your code by removing the need to declare and manage an `MPI_Status` object. This is particularly useful in straightforward message-passing scenarios. ::: @@ -399,7 +404,7 @@ int main(int argc, char **argv) { ::::solution `MPI_Send()` will block execution until the receiving process has called `MPI_Recv()`. This prevents the sender from unintentionally modifying the message buffer before the message is actually sent. Above, both ranks call `MPI_Send()` and just wait for the other to respond. The solution is to have one of the ranks receive its message before sending. -Sometimes `MPI_Send()` will actually make a copy of the buffer and return immediately. This generally happens only for short messages. Even when this happens, the actual transfer will not start before the receive is posted. +Sometimes `MPI_Send()` will actually make a copy of the buffer and return immediately. This generally happens only for short messages. Even when this happens, the actual transfer will not start before the receive is posted. For this example, let’s have rank 0 send first, and rank 1 receive first. So all we need to do to fix this is to swap the send and receive for rank 1: @@ -413,6 +418,7 @@ if (rank == 0) { MPI_Ssend(&numbers, ARRAY_SIZE, MPI_INT, 0, comm_tag, MPI_COMM_WORLD); } ``` + :::: ::::: @@ -490,5 +496,6 @@ int main(int argc, char** argv) { return MPI_Finalize(); } ``` + :::: -::::: \ No newline at end of file +::::: diff --git a/high_performance_computing/hpc_mpi/05_collective_communication.md b/high_performance_computing/hpc_mpi/05_collective_communication.md index b70bce81..ef5a7a52 100644 --- a/high_performance_computing/hpc_mpi/05_collective_communication.md +++ b/high_performance_computing/hpc_mpi/05_collective_communication.md @@ -13,10 +13,10 @@ learningOutcomes: - Learn how to use collective communication functions. --- -The previous episode showed how to send data from one rank to another using point-to-point communication. -If we wanted to send data from multiple ranks to a single rank to, for example, add up the value of a variable across multiple -ranks, we have to manually loop over each rank to communicate the data. This type of communication, where multiple ranks talk to -one another known as called collective communication. In the code example below, point-to-point communication is used to +The previous episode showed how to send data from one rank to another using point-to-point communication. +If we wanted to send data from multiple ranks to a single rank to, for example, add up the value of a variable across multiple +ranks, we have to manually loop over each rank to communicate the data. This type of communication, where multiple ranks talk to +one another known as called collective communication. In the code example below, point-to-point communication is used to calculate the sum of the rank numbers - feel free to try it out! ```c @@ -61,15 +61,15 @@ int main(int argc, char **argv) { } ``` -For its use case, the code above works perfectly fine. However, it isn't very efficient when you need to communicate large -amounts of data, have lots of ranks, or when the workload is uneven (due to the blocking communication). It's also a lot of code -to do not much, which makes it easy to introduce mistakes in our code. A common mistake in this example would be to start the -loop over ranks from 0, which would cause a deadlock! It's actually quite a common mistake for new MPI users to write something +For its use case, the code above works perfectly fine. However, it isn't very efficient when you need to communicate large +amounts of data, have lots of ranks, or when the workload is uneven (due to the blocking communication). It's also a lot of code +to do not much, which makes it easy to introduce mistakes in our code. A common mistake in this example would be to start the +loop over ranks from 0, which would cause a deadlock! It's actually quite a common mistake for new MPI users to write something like the above. -We don't need to write code like this (unless we want *complete* control over the data communication), because MPI has -access to collective communication functions to abstract all of this code for us. The above code can be replaced by a single -collective communication function. Collection operations are also implemented far more efficiently in the MPI library than we +We don't need to write code like this (unless we want *complete* control over the data communication), because MPI has +access to collective communication functions to abstract all of this code for us. The above code can be replaced by a single +collective communication function. Collection operations are also implemented far more efficiently in the MPI library than we could ever write using point-to-point communications. There are several collective operations that are implemented in the MPI standard. The most commonly-used are: @@ -92,14 +92,15 @@ int MPI_Barrier( MPI_Comm comm ); ``` + | | | |-------|--------------------------------------| | comm: | The communicator to add a barrier to | -When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. +When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. We should also avoid using barriers in parts of our program has have complicated branches, as we may introduce a deadlock by having a barrier in only one branch. -In practise, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, +In practise, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, where each rank has its own private memory space and often resources, it's rare that we need to care about ranks becoming out-of-sync. However, one use case is when multiple ranks need to write *sequentially* to the same file. The code example below shows how you may handle this by using a barrier. ```c @@ -115,8 +116,8 @@ for (int i = 0; i < num_ranks; ++i) { ### Broadcast -We'll often find that we need to data from one rank to all the other ranks. One approach, which is not very efficient, -is to use `MPI_Send()` in a loop to send the data from rank to rank one by one. A far more efficient approach is to use the +We'll often find that we need to data from one rank to all the other ranks. One approach, which is not very efficient, +is to use `MPI_Send()` in a loop to send the data from rank to rank one by one. A far more efficient approach is to use the collective function `MPI_Bcast()` to *broadcast* the data from a root rank to every other rank. The `MPI_Bcast()` function has the following arguments, @@ -129,6 +130,7 @@ int MPI_Bcast( MPI_Comm comm ); ``` + | | | |-------------|-------------------------------------------------------| | `*data`: | The data to be sent to all ranks | @@ -138,18 +140,16 @@ int MPI_Bcast( | `comm`: | The communicator containing the ranks to broadcast to | `MPI_Bcast()` is similar to the `MPI_Send()` function. -The main functional difference is that `MPI_Bcast()` sends the data to all ranks (other than itself, where the data already is) +The main functional difference is that `MPI_Bcast()` sends the data to all ranks (other than itself, where the data already is) instead of a single rank, as shown in the diagram below. ![Each rank sending a piece of data to root rank](fig/broadcast.png) -Unlike `MPI_Send()` and `MPI_Recv()`, collective operations like `MPI_Bcast()` do not require explicitly matching sends and -receives in the user code. The internal implementation of collective functions ensures that all ranks correctly send and +Unlike `MPI_Send()` and `MPI_Recv()`, collective operations like `MPI_Bcast()` do not require explicitly matching sends and +receives in the user code. The internal implementation of collective functions ensures that all ranks correctly send and receive data as needed, abstracting this complexity from the programmer. This makes collective operations easier to use and less error-prone compared to point-to-point communication. - -There are lots of use cases for broadcasting data. -One common case is when data is sent back to a "root" rank to process, which then broadcasts the results back out to all the other ranks. +There are lots of use cases for broadcasting data. One common case is when data is sent back to a "root" rank to process, which then broadcasts the results back out to all the other ranks. Another example, shown in the code excerpt below, is to read data in on the root rank and to broadcast it out. This is useful pattern on some systems where there are not enough resources (filesystem bandwidth, limited concurrent I/O operations) for every rank to read the file at once. @@ -168,6 +168,7 @@ MPI_Bcast(data_from_file, NUM_POINTS, MPI_INT, 0, MPI_COMM_WORLD); ``` :::::challenge{id=sending-greetings, title="Sending Greetings"} + Send a message from rank 0 saying "Hello from rank 0" to all ranks using `MPI_Bcast()`. ::::solution @@ -197,6 +198,7 @@ int main(int argc, char **argv) { return MPI_Finalize(); } ``` + :::: ::::: @@ -237,12 +239,11 @@ int MPI_Scatter( The data to be *scattered* is split into even chunks of size `sendcount`. If `sendcount` is 2 and `sendtype` is `MPI_INT`, then each rank will receive two integers. -The values for `recvcount` and `recvtype` are the same as `sendcount` and `sendtype`. However, there are cases where `sendcount` -and `recvcount` might differ, such as when using derived types, which change how data is packed and unpacked during communication. -For more on derived types and their impact on collective operations, see the -[Derived Data Types](07-derived-data-types.md) episode. -If the total amount of data is not evenly divisible by the number of processes, `MPI_Scatter()` will not work. -In this case, we need to use [`MPI_Scatterv()`](https://www.open-mpi.org/doc/v4.0/man3/MPI_Scatterv.3.php) instead to specify the amount of data each rank will receive. +The values for `recvcount` and `recvtype` are the same as `sendcount` and `sendtype`. However, there are cases where `sendcount` +and `recvcount` might differ, such as when using derived types, which change how data is packed and unpacked during communication. +For more on derived types and their impact on collective operations, see the +[Derived Data Types](07-derived-data-types.md) episode. If the total amount of data is not evenly divisible by the number of processes, `MPI_Scatter()` will +not work. In this case, we need to use [`MPI_Scatterv()`](https://www.open-mpi.org/doc/v4.0/man3/MPI_Scatterv.3.php) instead to specify the amount of data each rank will receive. The code example below shows `MPI_Scatter()` being used to send data which has been initialised only on the root rank. ```c @@ -282,6 +283,7 @@ int MPI_Gather( MPI_Comm comm ); ``` + | | | |--------------|--------------------------------------------------------------------------| | `*sendbuff`: | The data to send to the root rank | @@ -432,6 +434,7 @@ int MPI_Allreduce( MPI_Comm comm  ); ``` + | | | |----------------|--------------------------------------------------| | `*sendbuff`: | The data to be reduced, on all ranks | @@ -443,7 +446,7 @@ int MPI_Allreduce( ![Each rank sending a piece of data to root rank](fig/allreduce.png) -`MPI_Allreduce()` performs the same operations as `MPI_Reduce()`, but the result is sent to all ranks rather than only being available on the root rank. +`MPI_Allreduce()` performs the same operations as `MPI_Reduce()`, but the result is sent to all ranks rather than only being available on the root rank. This means we can remove the `MPI_Bcast()` in the previous code example and remove almost all of the code in the reduction example using point-to-point communication at the beginning of the episode. This is shown in the following code example: ```c @@ -573,6 +576,7 @@ double find_maximum(double *vector, int N) { return global_max; } ``` + :::: ::::: @@ -580,6 +584,6 @@ double find_maximum(double *vector, int N) { ## More collective operations are available -The collective functions introduced in this episode do not represent an exhaustive list of *all* collective operations in MPI. There are a number which are not covered, as their usage is not as common. You can usually find a list of the collective functions available for the implementation of MPI you choose to use, e.g. +The collective functions introduced in this episode do not represent an exhaustive list of *all* collective operations in MPI. There are a number which are not covered, as their usage is not as common. You can usually find a list of the collective functions available for the implementation of MPI you choose to use, e.g. [Microsoft MPI documentation](https://learn.microsoft.com/en-us/message-passing-interface/mpi-collective-functions). -:::: \ No newline at end of file +:::: diff --git a/high_performance_computing/hpc_mpi/06_non_blocking_communication.md b/high_performance_computing/hpc_mpi/06_non_blocking_communication.md index a2a7bce9..afdec2ab 100644 --- a/high_performance_computing/hpc_mpi/06_non_blocking_communication.md +++ b/high_performance_computing/hpc_mpi/06_non_blocking_communication.md @@ -36,9 +36,9 @@ But, this isn't the complete picture. As we'll see later, we need to do some add non-blocking communications. :::: -By effectively utilising non-blocking communication, we can develop applications that scale significantly better during intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. -Since non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging to see if the communication has finished, -to ensure ranks are synchronised. If we check too often, or don't have enough tasks to "fill in the gaps", then there is no advantage to using non-blocking communication, and we may replace communication overheads with time spent keeping ranks in sync! It is not always clear-cut or predictable if non-blocking communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. This essentially makes that non-blocking communication a blocking communication. +By effectively utilising non-blocking communication, we can develop applications that scale significantly better during intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. +Since non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging to see if the communication has finished, +to ensure ranks are synchronised. If we check too often, or don't have enough tasks to "fill in the gaps", then there is no advantage to using non-blocking communication, and we may replace communication overheads with time spent keeping ranks in sync! It is not always clear-cut or predictable if non-blocking communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. This essentially makes that non-blocking communication a blocking communication. Therefore, unless our code is structured to take advantage of being able to overlap communication with computation, non-blocking communication adds complexity to our code for no gain. ![Non-blocking communication with data dependency](fig/non-blocking-wait-data.png) @@ -59,7 +59,9 @@ On the other hand, some disadvantages are: - It is more difficult to use non-blocking communication. Not only does it result in more, and more complex, lines of code, we also have to worry about rank synchronisation and data dependency. + - Whilst typically using non-blocking communication, where appropriate, improves performance, it's not always clear-cut or predictable if non-blocking will result in sufficient performance gains to justify the increased complexity. + ::: :::: @@ -84,8 +86,8 @@ int MPI_Isend( |-------------|-----------------------------------------------------| | `*buf`: | The data to be sent | | `count`: | The number of elements of data | -| `datatype`: | The data types of the data | -| `dest`: | The rank to send data to | +| `datatype`: | The data types of the data | +| `dest`: | The rank to send data to | | `tag`: | The communication tag | | `comm`: | The communicator | | `*request`: | The request handle, used to track the communication | @@ -93,16 +95,17 @@ int MPI_Isend( The arguments are identical to `MPI_Send()`, other than the addition of the `*request` argument. This argument is known as a *handle* (because it "handles" a communication request) which is used to track the progress of a (non-blocking) communication. -When we use non-blocking communication, we have to follow it up with `MPI_Wait()` to synchronise -the program and make sure `*buf` is ready to be re-used. This is incredibly important to do. +When we use non-blocking communication, we have to follow it up with `MPI_Wait()` to synchronise +the program and make sure `*buf` is ready to be re-used. This is incredibly important to do. Suppose we are sending an array of integers, ```c MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); some_ints[1] = 5; /* !!! don't do this !!! */ ``` -Modifying `some_ints` before the **send** has completed is undefined behaviour, and can result in breaking results! For -example, if `MPI_Isend()` decides to use its buffered mode then modifying `some_ints` before it's finished being copied to + +Modifying `some_ints` before the **send** has completed is undefined behaviour, and can result in breaking results! For +example, if `MPI_Isend()` decides to use its buffered mode then modifying `some_ints` before it's finished being copied to the send buffer means the wrong data is sent. Every non-blocking communication has to have a corresponding `MPI_Wait()`, to wait and synchronise the program to ensure that the data being sent is ready to be modified again. `MPI_Wait()` is a blocking function which will only return when a communication has finished. ```c @@ -111,6 +114,7 @@ int MPI_Wait( MPI_Status *status ); ``` + | | | |-------------|------------------------------------------| | `*request`: | The request handle for the communication | @@ -141,8 +145,8 @@ int MPI_Irecv( | `comm`: | The communicator | | `*request`: | The request handle for the receive | - :::::challenge{id=true-or-false, title="True or False?"} + Is the following statement true or false? Non-blocking communication guarantees immediate completion of data transfer. ::::solution @@ -235,8 +239,8 @@ MPI_Request requests[2] = { send_req, recv_req }; MPI_Waitall(2, requests, statuses); // Wait for both requests in one function ``` -This version of the code will not deadlock, because the non-blocking functions return immediately. So even though rank -0 and 1 one both send, meaning there is no corresponding **receive**, the immediate return from send means the +This version of the code will not deadlock, because the non-blocking functions return immediately. So even though rank +0 and 1 one both send, meaning there is no corresponding **receive**, the immediate return from send means the receive function is still called. Thus, a deadlock cannot happen. However, it is still possible to create a deadlock using `MPI_Wait()`. If `MPI_Wait()` is waiting to for `MPI_Irecv()` to get some data, but there is no matching send operation (so no data has been sent), then `MPI_Wait()` can never return resulting in a deadlock. In the example code below, rank 0 becomes deadlocked. @@ -255,6 +259,7 @@ if (my_rank == 0) { MPI_Wait(&send_req, &status); MPI_Wait(&recv_req, &status); // Wait for both requests in one function ``` + :::: ::::: @@ -269,6 +274,7 @@ int MPI_Test( MPI_Status *status, ); ``` + | | | |-------------|-------------------------------------------------------| | `*request`: | The request handle for the communication | @@ -596,4 +602,4 @@ int main(int argc, char **argv) ``` :::: -::::: \ No newline at end of file +::::: diff --git a/high_performance_computing/hpc_mpi/07-derived-data-types.md b/high_performance_computing/hpc_mpi/07-derived-data-types.md index b5e907b4..761b707b 100644 --- a/high_performance_computing/hpc_mpi/07-derived-data-types.md +++ b/high_performance_computing/hpc_mpi/07-derived-data-types.md @@ -19,24 +19,24 @@ To help with this, MPI provides an interface to create new types known as derive ::::callout -## Size limitations for messages +## Size limitations for messages -All throughout MPI, the argument which says how many elements of data are being communicated is an integer: int count. -In most 64-bit Linux systems, ints are usually 32-bit and so the biggest number you can pass to count is 2^31 - 1 = 2,147,483,647, which is about 2 billion. Arrays which exceed this length can't be communicated easily in versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in smaller, more manageable chunks. The other is to use derived types, to re-shape the data. +All throughout MPI, the argument which says how many elements of data are being communicated is an integer: int count. +In most 64-bit Linux systems, ints are usually 32-bit and so the biggest number you can pass to count is 2^31 - 1 = 2,147,483,647, which is about 2 billion. Arrays which exceed this length can't be communicated easily in versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in smaller, more manageable chunks. The other is to use derived types, to re-shape the data. :::: Almost all scientific and computing problems nowadays require us to think in more than one dimension. Using multi-dimensional arrays, such for matrices or tensors, or discretising something onto a 2D or 3D grid of points are fundamental parts of a lot of software. However, the additional dimensions comes with additional complexity, -not just in the code we write, but also in how data is communicated. +not just in the code we write, but also in how data is communicated. -To create a 2 x 3 matrix, in C, and initialize it with some values, we use the following syntax, +To create a 2 x 3 matrix, in C, and initialize it with some values, we use the following syntax, ```c int matrix[2][3] = { {1, 2, 3}, {4, 5, 6} }; // matrix[rows][cols] ``` -This creates an array with two rows and three columns. The first row contains the values {1, 2, 3} and the second row contains {4, 5, 6}. The number of rows and columns can be any value, as long as there is enough memory available. +This creates an array with two rows and three columns. The first row contains the values {1, 2, 3} and the second row contains {4, 5, 6}. The number of rows and columns can be any value, as long as there is enough memory available. ## The importance of memory contiguity @@ -56,31 +56,29 @@ followed by `x(i + 1, j)` and so on. The diagram below shows how a 4 x 4 matrix is mapped onto a linear memory space, for a row-major array. At the top of the diagram is the representation of the linear memory space, where each number is ID of the element in memory. Below that are two representations of the array in 2D: the left shows the coordinate of each element and the right shows the -ID of the element. +ID of the element. -![Column memory layout in C](fig/c_column_memory_layout.png) +![Column memory layout in C](fig/c_column_memory_layout.png)The purple elements (5, 6, 7, 8) which map to the coordinates `[1][0]`, `[1][1]`, `[1][2]` and `[1][3]` are contiguous in linear memory. The same applies for the orange boxes for the elements in row 2 (elements 9, 10, 11 and 12). Columns in row-major arrays are contiguous. The next diagram instead shows how elements in adjacent rows are mapped in memory. -The purple elements (5, 6, 7, 8) which map to the coordinates `[1][0]`, `[1][1]`, `[1][2]` and `[1][3]` are contiguous in linear memory. The same applies for the orange boxes for the elements in row 2 (elements 9, 10, 11 and 12). Columns in row-major arrays are contiguous. The next diagram instead shows how elements in adjacent rows are mapped in memory. +![Row memory layout in C](fig/c_row_memory_layout.png) -![Row memory layout in C](fig/c_row_memory_layout.png) - -Looking first at the purple boxes (containing elements 2, 6, 10 and 14) which make up the row elements for column 1, we can see that the elements are not contiguous. Element [0][1] maps to element 2 and element [1][1] maps to element 6 and so on. Elements in the same column but in a different row are separated by four other elements, in this example. In other words, elements in other rows are not contiguous. +Looking first at the purple boxes (containing elements 2, 6, 10 and 14) which make up the row elements for column 1, we can see that the elements are not contiguous. Element `[0][1]` maps to element 2 and element `[1][1]` maps to element 6 and so on. Elements in the same column but in a different row are separated by four other elements, in this example. In other words, elements in other rows are not contiguous. :::::challenge{id=memory-contiquity, title="Does memory contiguity affect performance?"} -Do you think memory contiguity could impact the performance of our software, in a negative way? +Do you think memory contiguity could impact the performance of our software, in a negative way? ::::solution -Yes, memory contiguity can affect how fast our programs run. When data is stored in a neat and organized way, the computer can find and use it quickly. But if the data is scattered around randomly (fragmented), it takes more time to locate and use it, which decreases performance. Keeping our data and data access patterns organized can make our programs faster. But we probably won't notice the difference for small arrays and data structures. +Yes, memory contiguity can affect how fast our programs run. When data is stored in a neat and organized way, the computer can find and use it quickly. But if the data is scattered around randomly (fragmented), it takes more time to locate and use it, which decreases performance. Keeping our data and data access patterns organized can make our programs faster. But we probably won't notice the difference for small arrays and data structures. :::: ::::: -::::callout +::::callout ## What about if I use `malloc()`? -More often than not we will see `malloc()` being used to allocate memory for arrays. Especially if the code is using an older standard, such as C90, which does not support +More often than not we will see `malloc()` being used to allocate memory for arrays. Especially if the code is using an older standard, such as C90, which does not support [variable length arrays](https://en.wikipedia.org/wiki/Variable-length_array). When we use `malloc()`, we get a contiguous array of elements. To create a 2D array using `malloc()`, we have to first create an array of pointers (which are contiguous) and allocate memory for each pointer: ```c @@ -98,15 +96,15 @@ for (int i = 0; i < num_rows; ++i) { } ``` -There is one problem though. `malloc()` does not guarantee that subsequently allocated memory will be contiguous. When `malloc()` requests memory, the operating system will assign whatever memory is free. This is not always next to the block of memory from the previous allocation. This makes life tricky, since data *has* to be contiguous for MPI communication. But there are workarounds. One is to only use 1D arrays (with the same number of elements as the higher dimension array) and to map the n-dimensional coordinates into a linear coordinate system. For example, the element -`[2][4]` in a 3 x 5 matrix would be accessed as, +There is one problem though. `malloc()` does not guarantee that subsequently allocated memory will be contiguous. When `malloc()` requests memory, the operating system will assign whatever memory is free. This is not always next to the block of memory from the previous allocation. This makes life tricky, since data *has* to be contiguous for MPI communication. But there are workarounds. One is to only use 1D arrays (with the same number of elements as the higher dimension array) and to map the n-dimensional coordinates into a linear coordinate system. For example, the element +`[2][4]` in a 3 x 5 matrix would be accessed as, ```c int index_for_2_4 = matrix1d[5 * 2 + 4]; // num_cols * row + col ``` Another solution is to move memory around so that it is contiguous, such as in [this example](code/examples/07-malloc-trick.c) or by using a more sophisticated function such as [`arralloc()` function](code/arralloc.c) (not part of the standard library) which can allocate arbitrary n-dimensional arrays into a contiguous block. -:::: +:::: For a row-major array, we can send the elements of a single row (for a 4 x 4 matrix) easily, @@ -114,12 +112,12 @@ For a row-major array, we can send the elements of a single row (for a 4 x 4 mat MPI_Send(&matrix[1][0], 4, MPI_INT ...); ``` -The send buffer is `&matrix[1][0]`, which is the memory address of the first element in row 1. As the columns are four elements long, we have specified to only send four integers. Even though we're working here with a 2D array, sending a single row of the matrix is the same as sending a 1D array. Instead of using a pointer to the start of the array, an address to the first element of the row (`&matrix[1][0]`) is used instead. It's not possible to do the same for a column of the matrix, because the elements down the column are not contiguous. +The send buffer is `&matrix[1][0]`, which is the memory address of the first element in row 1. As the columns are four elements long, we have specified to only send four integers. Even though we're working here with a 2D array, sending a single row of the matrix is the same as sending a 1D array. Instead of using a pointer to the start of the array, an address to the first element of the row (`&matrix[1][0]`) is used instead. It's not possible to do the same for a column of the matrix, because the elements down the column are not contiguous. ## Using vectors to send slices of an array To send a column of a matrix or array, we have to use a *vector*. A vector is a derived data type that represents multiple (or one) contiguous sequences of elements, which have a regular spacing between them. By using vectors, we can create data types for column vectors, row vectors or sub-arrays, similar to how we can -[create slices for Numpy arrays in Python](https://numpy.org/doc/stable/user/basics.indexing.html), all of which can be sent in a single, efficient, communication. +[create slices for Numpy arrays in Python](https://numpy.org/doc/stable/user/basics.indexing.html), all of which can be sent in a single, efficient, communication. To create a vector, we create a new data type using `MPI_Type_vector()`, ```c @@ -131,6 +129,7 @@ int MPI_Type_vector( MPI_Datatype *newtype ); ``` + | | | |----------------|----------------------------------------------------------------------| | `count`: | The number of "blocks" which make up the vector | @@ -140,39 +139,38 @@ int MPI_Type_vector( | `newtype`: | The newly created data type to represent the vector | To understand what the arguments mean, look at the diagram below showing a vector to send two rows of a 4 x 4 matrix -with a row in between (rows 2 and 4), - -![How a vector is laid out in memory"](fig/vector_linear_memory.png) +with a row in between (rows 2 and 4), +![How a vector is laid out in memory"](fig/vector_linear_memory.png) A *block* refers to a sequence of contiguous elements. In the diagrams above, each sequence of contiguous purple or orange elements represents a block. The *block length* is the number of elements within a block; in the above this is four. The *stride* is the distance between the start of each block, which is eight in the example. The count is the number of blocks we want. When we create a vector, we're creating a new derived data type which includes one or more -blocks of contiguous elements. +blocks of contiguous elements. ::::callout -## Why is this functionality useful? +## Why is this functionality useful? -The advantage of using derived types to send vectors is to streamline and simplify communication of complex and non-contiguous data. They are most commonly used where there are boundary regions between MPI ranks, such as in simulations using domain decomposition (see the optional Common Communication Patterns episode for more detail), irregular meshes or composite data structures (covered in the optional Advanced Data Communication episode). -:::: +The advantage of using derived types to send vectors is to streamline and simplify communication of complex and non-contiguous data. They are most commonly used where there are boundary regions between MPI ranks, such as in simulations using domain decomposition (see the optional Common Communication Patterns episode for more detail), irregular meshes or composite data structures (covered in the optional Advanced Data Communication episode). +:::: -Before we can use the vector we create to communicate data, it has to be committed using `MPI_Type_commit()`. This finalises the creation of a derived type. Forgetting to do this step leads to unexpected behaviour, and potentially disastrous consequences! +Before we can use the vector we create to communicate data, it has to be committed using `MPI_Type_commit()`. This finalises the creation of a derived type. Forgetting to do this step leads to unexpected behaviour, and potentially disastrous consequences! ```c int MPI_Type_commit( MPI_Datatype *datatype // The data type to commit - note that this is a pointer ); -``` +``` -When a data type is committed, resources which store information on how to handle it are internally allocated. This contains data structures such as memory buffers as well as data used for bookkeeping. Failing to free those resources after finishing with the vector leads to memory leaks, just like when we don't free memory created using `malloc()`. To free up the resources, we use `MPI_Type_free()`, +When a data type is committed, resources which store information on how to handle it are internally allocated. This contains data structures such as memory buffers as well as data used for bookkeeping. Failing to free those resources after finishing with the vector leads to memory leaks, just like when we don't free memory created using `malloc()`. To free up the resources, we use `MPI_Type_free()`, ```c int MPI_Type_free ( MPI_Datatype *datatype // The data type to clean up -- note this is a pointer ); -``` +``` The following example code uses a vector to send two rows from a 4 x 4 matrix, as in the example diagram above. @@ -215,9 +213,9 @@ MPI_Type_free(&rows_type); There are two things above, which look quite innocent, but are important to understand. First of all, the send buffer in `MPI_Send()` is not `matrix` but `&matrix[1][0]`. In `MPI_Send()`, the send buffer is a pointer to the memory location where the start of the data is stored. In the above example, the intention is to only send the second and forth rows, so the start location of the data to send is the address for element `[1][0]`. If we used `matrix`, the first and third rows would be sent instead. -The other thing to notice, which is not immediately clear why it's done this way, is that the receive data type is `MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This is because when a rank receives data, the data is contiguous array. We don't need to use a vector to describe the layout of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. +The other thing to notice, which is not immediately clear why it's done this way, is that the receive data type is `MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This is because when a rank receives data, the data is contiguous array. We don't need to use a vector to describe the layout of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. -::::challenge{id=sending-columns, title="Sending columns from an array"} +::::challenge{id=sending-columns title="Sending columns from an array"} Create a vector type to send a column in the following 2 x 3 array: @@ -226,17 +224,17 @@ int matrix[2][3] = { {1, 2, 3}, {4, 5, 6}, }; -``` +``` -With that vector type, send the middle column of the matrix (elements `matrix[0][1]` and `matrix[1][1]`) from rank 0 to rank 1 and print the results. You may want to use [this code](code/solutions/skeleton-example.c) as your starting point. +With that vector type, send the middle column of the matrix (elements `matrix[0][1]` and `matrix[1][1]`) from rank 0 to rank 1 and print the results. You may want to use [this code](code/solutions/skeleton-example.c) as your starting point. :::solution -If your solution is correct you should see 2 and 5 printed to the screen. In the solution below, to send a 2 x 1 column of the matrix, we created a vector with `count = 2`, `blocklength = 1` and `stride = 3`. To send the correct column our send buffer was `&matrix[0][1]` which is the address of the first element in column 1. To see why the stride is 3, take a look at the diagram below, +If your solution is correct you should see 2 and 5 printed to the screen. In the solution below, to send a 2 x 1 column of the matrix, we created a vector with `count = 2`, `blocklength = 1` and `stride = 3`. To send the correct column our send buffer was `&matrix[0][1]` which is the address of the first element in column 1. To see why the stride is 3, take a look at the diagram below, -![Stride example for question](fig/stride_example_2x3.png) +![Stride example for question](fig/stride_example_2x3.png) -You can see that there are *three* contiguous elements between the start of each block of 1. +You can see that there are *three* contiguous elements between the start of each block of 1. ```c #include @@ -285,14 +283,15 @@ int main(int argc, char **argv) return MPI_Finalize(); } -``` +``` + ::: -:::: +:::: -::::challenge{id=sending-subarrays, title="Sending sub-arrays of an array"} +::::challenge{id=sending-subarrays title="Sending sub-arrays of an array"} -By using a vector type, send the middle four elements (6, 7, 10, 11) in the following 4 x 4 matrix from rank 0 to rank -1, +By using a vector type, send the middle four elements (6, 7, 10, 11) in the following 4 x 4 matrix from rank 0 to rank +1, ```c int matrix[4][4] = { @@ -302,16 +301,17 @@ int matrix[4][4] = { {13, 14, 15, 16} }; ``` -You can re-use most of your code from the previous exercise as your starting point, replacing the 2 x 3 matrix with the 4 x 4 matrix above and modifying the vector type and communication functions as required. -:::solution +You can re-use most of your code from the previous exercise as your starting point, replacing the 2 x 3 matrix with the 4 x 4 matrix above and modifying the vector type and communication functions as required. -The receiving rank(s) should receive the numbers 6, 7, 10 and 11 if your solution is correct. In the solution below, we have created a vector with a count and block length of 2 and with a stride of 4. The first two arguments means two vectors of block length 2 will be sent. The stride of 4 results from that there are 4 elements between the start of each distinct block as shown in the image below, +:::solution + +The receiving rank(s) should receive the numbers 6, 7, 10 and 11 if your solution is correct. In the solution below, we have created a vector with a count and block length of 2 and with a stride of 4. The first two arguments means two vectors of block length 2 will be sent. The stride of 4 results from that there are 4 elements between the start of each distinct block as shown in the image below, -![Stride example for subarray question](fig/stride_example_4x4.png) +![Stride example for subarray question](fig/stride_example_4x4.png) -You must always remember to send the address for the starting point of the *first* block as the send buffer, which -is why `&array[1][1]` is the first argument in `MPI_Send()`. +You must always remember to send the address for the starting point of the *first* block as the send buffer, which +is why `&array[1][1]` is the first argument in `MPI_Send()`. ```c #include @@ -363,5 +363,6 @@ int main(int argc, char **argv) return MPI_Finalize(); } ``` + ::: -:::: \ No newline at end of file +:::: diff --git a/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md b/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md index 2dc1075a..e853451f 100644 --- a/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md +++ b/high_performance_computing/hpc_mpi/08_porting_serial_to_mpi.md @@ -289,7 +289,7 @@ Then at the very end of `main()` let's complete our use of MPI: ### `main()`: Initialising the Simulation and Printing the Result -Since we're not initialising the entire stick (`GRIDSIZE`), but only the section apportioned to our rank (`rank_gridsize`), +Since we're not initialising the entire stick (`GRIDSIZE`), but only the section apportioned to our rank (`rank_gridsize`), we need to adjust the loop that initialises `u` and `rho` accordingly. The revised loop as follows: ```c @@ -376,7 +376,7 @@ Insert the following into the `poisson_step()` function, putting the declaration MPI_Allreduce(&unorm, &global_unorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); ``` -So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value +So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value global_unorm`. We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: ```c @@ -463,7 +463,6 @@ Once complete across all ranks, every rank will then have the slice boundary dat You can obtain a full version of the parallelised Poisson code from [here](code/examples/poisson/poisson_mpi.c). Now we have the parallelised code in place, we can compile and run it, e.g.: - ```bash mpicc poisson_mpi.c -o poisson_mpi mpirun -n 2 poisson_mpi @@ -513,4 +512,4 @@ At initialisation, instead of setting it to zero we could do: ``` :::: -::::: \ No newline at end of file +::::: diff --git a/high_performance_computing/hpc_mpi/09_optimising_mpi.md b/high_performance_computing/hpc_mpi/09_optimising_mpi.md index 3fbaec79..4e47c0c8 100644 --- a/high_performance_computing/hpc_mpi/09_optimising_mpi.md +++ b/high_performance_computing/hpc_mpi/09_optimising_mpi.md @@ -28,8 +28,8 @@ Therefore, it's really helpful to understand how well our code *scales* in perfo ## Prerequisite: [Intro to High Performance Computing](../hpc_intro/01_hpc_intro) -Whilst the previous episodes can be done on a laptop or desktop, this episode covers how to profile your code using tools that are only available on -an HPC cluster. +Whilst the previous episodes can be done on a laptop or desktop, this episode covers how to profile your code using tools +that are only available on an HPC cluster. :::: ## Characterising the Scalability of Code @@ -49,10 +49,9 @@ Ideally, we would like software to have a linear speedup that is equal to the nu (speedup = N), as that would mean that every processor would be contributing 100% of its computational power. Unfortunately, this is a very challenging goal for real applications to attain, since there is always an overhead to making parallel use of greater resources. -In addition, in a program there is always some portion of it which must be executed in serial (such as initialisation routines, I/O operations and -inter-communication) which cannot be parallelised. -This limits how much a program can be sped up, -as the program will always take at least the length of the serial portion. +In addition, in a program there is always some portion of it which must be executed in serial (such as initialisation +routines, I/O operations and inter-communication) which cannot be parallelised. +This limits how much a program can be sped up, as the program will always take at least the length of the serial portion. ### Amdahl's Law and Strong Scaling @@ -136,6 +135,7 @@ The 32 ranks don't fit on one CPU and communicating between the two CPUs takes m The communication could be made more efficient. We could use non-blocking communication and do some of the computation while communication is happening. + :::: ::::: @@ -186,7 +186,9 @@ The increase in runtime is probably due to the memory bandwidth of the node bein The other significant factors in the speed of a parallel program are communication speed and latency. -Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying hardware for the communication. Latency consists of the software latency (how long the operating system needs in order to prepare for a communication), and the hardware latency (how long the hardware takes to send/receive even a small bit of data). +Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying +hardware for the communication. Latency consists of the software latency (how long the operating system needs in order to prepare +for a communication), and the hardware latency (how long the hardware takes to send/receive even a small bit of data). For a fixed-size problem, the time spent in communication is not significant when the number of ranks is small and the execution of parallel regions gets faster with the number of ranks. But if we keep increasing the number of ranks, the time spent in communication grows when multiple cores are involved with communication. @@ -284,7 +286,7 @@ terminal and one `.html` file which can be opened in a browser cat poisson_mpi_4p_1n_2024-01-30_15-38.txt ``` -``` +```text Command: mpirun -n 4 poisson_mpi Resources: 1 node (28 physical, 56 logical cores per node) Memory: 503 GiB per node @@ -319,8 +321,8 @@ spent in the actual compute sections of the code. :::::challenge{id=profile-poisson, title="Profile Your Poisson Code"} Compile, run and analyse your own MPI version of the poisson code. -How closely does it match the performance above? What are the main differences? -Try reducing the number of processes used, rerun and investigate the profile. Is it still MPI-bound? +How closely does it match the performance above? What are the main differences? +Try reducing the number of processes used, rerun and investigate the profile. Is it still MPI-bound? Increase the problem size, recompile, rerun and investigate the profile. What has changed now? @@ -343,4 +345,4 @@ A general workflow for optimising a code, whether parallel or serial, is as foll Then we can repeat this process as needed. But note that eventually this process will yield diminishing returns, and over-optimisation is a trap to avoid - hence the importance of continuing to measure efficiency as you progress. -:::: \ No newline at end of file +:::: diff --git a/high_performance_computing/hpc_mpi/10_communication_patterns.md b/high_performance_computing/hpc_mpi/10_communication_patterns.md index f1e2be2e..4ee6d804 100644 --- a/high_performance_computing/hpc_mpi/10_communication_patterns.md +++ b/high_performance_computing/hpc_mpi/10_communication_patterns.md @@ -89,7 +89,7 @@ In the next code example, a Monte Carlo algorithm is implemented which estimates To do this, a billion random points are generated and checked to see if they fall inside or outside a circle. The ratio of points inside the circle to the total number of points is proportional to the value of $\pi$. -Since each point generated and its position within the circle is completely independent to the other points, +Since each point generated and its position within the circle is completely independent to the other points, the communication pattern is simple (this is also an example of an embarrassingly parallel problem) as we only need one reduction. To parallelise the problem, each rank generates a sub-set of the total number of points and a reduction is done at the end, to calculate the total number of points within the circle from the entire sample. @@ -242,12 +242,14 @@ void scatter_sub_arrays_to_other_ranks(double *image, double *rank_image, MPI_Da MPI_Recv(rank_image, num_elements_per_rank, MPI_DOUBLE, ROOT_RANK, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } } + ``` + :::: The function [`MPI_Dims_create()`](https://www.open-mpi.org/doc/v4.1/man3/MPI_Dims_create.3.php) is a useful utility function in MPI which is used to determine the dimensions of a Cartesian grid of ranks. In the above example, it's used to determine the number of rows and columns in each sub-array, given the number of ranks in the row and column directions of the grid of ranks from `MPI_Dims_create()`. -In addition to the code above, you may also want to create a +In addition to the code above, you may also want to create a [*virtual Cartesian communicator topology*](https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node187.htm#Node187) to reflect the decomposed geometry in the communicator as well, as this give access to a number of other utility functions which makes communicating data easier. ### Halo exchange @@ -312,6 +314,7 @@ int MPI_Sendrecv( | `recvtag`: | The communication tag for the receive | | `comm`: | The communicator | | `*status`: | The status handle for the receive | + :::: ```c @@ -362,4 +365,4 @@ To communicate the halos, we need to: To re-build the sub-domains into one domain, we can do the reverse of the hidden code exert of the function `scatter_sub_arrays_to_other ranks`. Instead of the root rank sending data, it instead receives data from the other ranks using the same `sub_array_t` derived type. :::: -::::: \ No newline at end of file +::::: diff --git a/high_performance_computing/hpc_mpi/11_advanced_communication.md b/high_performance_computing/hpc_mpi/11_advanced_communication.md index 5f4f8190..157ee740 100644 --- a/high_performance_computing/hpc_mpi/11_advanced_communication.md +++ b/high_performance_computing/hpc_mpi/11_advanced_communication.md @@ -13,8 +13,11 @@ learningOutcomes: - Know how to define and use derived datatypes. --- -In an earlier episode, we introduced the concept of derived data types to send vectors or a sub-array of a larger array, which may or may not be contiguous in memory. Other than vectors, there are multiple other types of derived data types that allow us to handle other complex data structures efficiently. In this episode, we will see how to create structure derived types. Additionally, we will also learn how to use `MPI_Pack()` and `MPI_Unpack()` to manually pack complex data structures and heterogeneous into a single contiguous buffer, when other methods of communication are too complicated or inefficient. - +In an earlier episode, we introduced the concept of derived data types to send vectors or a sub-array of a larger array, +which may or may not be contiguous in memory. Other than vectors, there are multiple other types of derived data types that allow us to +handle other complex data structures efficiently. In this episode, we will see how to create structure derived types. Additionally, we will +also learn how to use `MPI_Pack()` and `MPI_Unpack()` to manually pack complex data structures and heterogeneous into a single contiguous +buffer, when other methods of communication are too complicated or inefficient. ## Structures in MPI @@ -47,12 +50,12 @@ int MPI_Type_create_struct( The main difference between vector and struct derived types is that the arguments for structs expect arrays, since structs are made up of multiple variables. Most of these arguments are straightforward, given what we've just seen for defining vectors. But `array_of_displacements` is new and unique. -When a struct is created, it occupies a single contiguous block of memory. But there is a catch. For performance reasons, compilers insert arbitrary "padding" between each member for performance reasons. This padding, known as +When a struct is created, it occupies a single contiguous block of memory. But there is a catch. For performance reasons, compilers insert arbitrary "padding" between each member for performance reasons. This padding, known as [data structure alignment](https://en.wikipedia.org/wiki/Data_structure_alignment), optimises both the layout of the memory and the access of it. As a result, the memory layout of a struct may look like this instead: -![Memory layout for a struct](fig/struct_memory_layout.png) +![Memory layout for a struct](fig/struct_memory_layout.png) -Although the memory used for padding and the struct's data exists in a contiguous block, +Although the memory used for padding and the struct's data exists in a contiguous block, the actual data we care about is no longer contiguous. This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the stride in vectors. To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. To do this, we can use the function `MPI_Get_address()`, @@ -198,6 +201,7 @@ int main(int argc, char **argv) return MPI_Finalize(); } ``` + :::: ::::: @@ -327,7 +331,9 @@ int MPI_Pack_size( MPI_Comm comm, int *size ); + ``` + | | | |-------------|----------------------------------------------------------| | `incount`: | The number of data elements | @@ -457,6 +463,7 @@ MPI_Get_count(&status, MPI_PACKED, &buffer_size); // MPI_PACKED represents an element of a "byte stream." So, buffer_size is the size of the buffer to allocate char *buffer = malloc(buffer_size); ``` + :::: :::::challenge{id=heterogeneous-data, title="Sending Heterogeneous Data in a Single Communication"} @@ -582,6 +589,8 @@ int main(int argc, char **argv) return MPI_Finalize(); } + ``` + :::: -::::: \ No newline at end of file +::::: From 734748425575e5f3fe0b79efa312394130fcc49e Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 10 Dec 2024 09:23:42 +0000 Subject: [PATCH 22/34] Fix markdown linting issues in hpc_parallel_intro --- .../hpc_parallel_intro/01_introduction.md | 44 ++++++++----------- 1 file changed, 19 insertions(+), 25 deletions(-) diff --git a/high_performance_computing/hpc_parallel_intro/01_introduction.md b/high_performance_computing/hpc_parallel_intro/01_introduction.md index 36c885ec..b3cf9f9b 100644 --- a/high_performance_computing/hpc_parallel_intro/01_introduction.md +++ b/high_performance_computing/hpc_parallel_intro/01_introduction.md @@ -35,26 +35,23 @@ The “_processing units_” might include central processing units (**CPU**s), Typical programming assumes that computers execute one operation at a time in the sequence specified by your program code. At any time step, the computer’s CPU core will be working on one particular operation from the sequence. In other words, a problem is broken into discrete series of instructions that are executed one for another. -Therefore only one instruction can execute at any moment in time. We will call this traditional style of sequential computing. +Therefore, only one instruction can execute at any moment in time. We will call this traditional style of sequential computing. In contrast, with parallel computing we will now be dealing with multiple CPU cores that each are independently and simultaneously working on a series of instructions. This can allow us to do much more at once, and therefore get results more quickly than if only running an equivalent sequential program. The act of changing sequential code to parallel code is called parallelisation. -| **Sequential Computing** | **Parallel Computing** | -| --- | --- | +| **Sequential Computing** | **Parallel Computing** | +|-------------------------------------------|----------------------------------------------| | ![Serial Computing](fig/serial2_prog.png) | ![Parallel Computing](fig/parallel_prog.png) | - -::::callout - -## Analogy +::::callout{variant="tip"} The basic concept of parallel computing is simple to understand: we divide our job in tasks that can be executed at the same time so that we finish the job in a fraction of the time that it would have taken if the tasks are executed one by one. Suppose that we want to paint the four walls in a room. This is our **problem**. We can divide our **problem** into 4 different **tasks**: paint each of the walls. -In principle, our 4 tasks are independent from each other in the sense that we don't need to finish one to start another. +In principle, our 4 tasks are independent of each other in the sense that we don't need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. -It all depends on on the amount of resources that we have for the tasks. +It all depends on the amount of resources that we have for the tasks. If there is only one painter, they could work for a while in one wall, then start painting another one, then work a little bit on the third one, and so on. The tasks are being executed concurrently **but not in parallel** and only one task is being performed at a time. @@ -124,9 +121,7 @@ While they provide an effective means of utilizing multiple CPU cores on a singl ![Threads](fig/multithreading.svg) -::::callout - -## Analogy +::::callout{variant="tip"} Let's go back to our painting 4 walls analogy. Our example painters have two arms, and could potentially paint with both arms at the same time. @@ -160,9 +155,7 @@ Distributed memory programming models, such as MPI, facilitate communication and - **Programming Complexity:** Shared memory programming models offer simpler constructs and require less explicit communication compared to distributed memory models. Distributed memory programming involves explicit data communication and synchronization, adding complexity to the programming process. -::::callout - -## Analogy +::::callout{variant="tip"} Imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. @@ -177,7 +170,7 @@ Suppose that worker A, for some reason, needs a colour that is only available in ::::callout -## Key Idea +### Key Idea In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous. @@ -188,13 +181,13 @@ For the different dispensers case for your workers, however, think of the memory ## MPI vs OpenMP: What is the difference? -| MPI | OpenMP | -|---------|-----------| -|Defines an API, vendors provide an optimized (usually binary) library implementation that is linked using your choice of compiler.|OpenMP is integrated into the compiler (e.g., gcc) and does not offer much flexibility in terms of changing compilers or operating systems unless there is an OpenMP compiler available for the specific platform.| -|Offers support for C, Fortran, and other languages, making it relatively easy to port code by developing a wrapper API interface for a pre-compiled MPI implementation in a different language.|Primarily supports C, C++, and Fortran, with limited options for other programming languages.| -|Suitable for both distributed memory and shared memory (e.g., SMP) systems, allowing for parallelization across multiple nodes.|Designed for shared memory systems and cannot be used for parallelization across multiple computers.| -|Enables parallelism through both processes and threads, providing flexibility for different parallel programming approaches.|Focuses solely on thread-based parallelism, limiting its scope to shared memory environments.| -|Creation of process/thread instances and communication can result in higher costs and overhead.|Offers lower overhead, as inter-process communication is handled through shared memory, reducing the need for expensive process/thread creation.| +| MPI | OpenMP | +|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Defines an API, vendors provide an optimized (usually binary) library implementation that is linked using your choice of compiler. | OpenMP is integrated into the compiler (e.g., gcc) and does not offer much flexibility in terms of changing compilers or operating systems unless there is an OpenMP compiler available for the specific platform. | +| Offers support for C, Fortran, and other languages, making it relatively easy to port code by developing a wrapper API interface for a pre-compiled MPI implementation in a different language. | Primarily supports C, C++, and Fortran, with limited options for other programming languages. | +| Suitable for both distributed memory and shared memory (e.g., SMP) systems, allowing for parallelization across multiple nodes. | Designed for shared memory systems and cannot be used for parallelization across multiple computers. | +| Enables parallelism through both processes and threads, providing flexibility for different parallel programming approaches. | Focuses solely on thread-based parallelism, limiting its scope to shared memory environments. | +| Creation of process/thread instances and communication can result in higher costs and overhead. | Offers lower overhead, as inter-process communication is handled through shared memory, reducing the need for expensive process/thread creation. | :::: @@ -261,7 +254,7 @@ for(i=0; i Date: Wed, 8 Jan 2025 01:28:20 +0000 Subject: [PATCH 23/34] Update 04_synchronisation.md Fixed typo --- high_performance_computing/hpc_openmp/04_synchronisation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/high_performance_computing/hpc_openmp/04_synchronisation.md b/high_performance_computing/hpc_openmp/04_synchronisation.md index 21d3aa41..e62c21e5 100644 --- a/high_performance_computing/hpc_openmp/04_synchronisation.md +++ b/high_performance_computing/hpc_openmp/04_synchronisation.md @@ -140,7 +140,7 @@ copied back into `current_matrix`. Without this barrier, threads might read out Here, the number of rows (`nx`) is dynamically determined at runtime using `omp_get_max_threads()`. This function provides the maximum number of threads OpenMP can use in a parallel region, based on the system's resources and runtime configuration. Using this value, we define the number of rows in the matrix, with each row corresponding to a potential thread. This setup -ensures that both the `current_matrix` and `next_matrix provide` rows for the maximum number of threads allocated during parallel execution. +ensures that both the `current_matrix` and `next_matrix` provide rows for the maximum number of threads allocated during parallel execution. ```c ...... From dca45c6779f249cecbaa83d95eb126932427cfe0 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Mon, 20 Jan 2025 14:22:21 +0000 Subject: [PATCH 24/34] Changed variable names for the data type example & updated answer for data type challenge --- .../hpc_mpi/03_communicating_data.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/high_performance_computing/hpc_mpi/03_communicating_data.md b/high_performance_computing/hpc_mpi/03_communicating_data.md index bfe7e258..198fa2f4 100644 --- a/high_performance_computing/hpc_mpi/03_communicating_data.md +++ b/high_performance_computing/hpc_mpi/03_communicating_data.md @@ -88,7 +88,7 @@ types are in the table below: Remember, these constants aren't the same as the primitive types in C, so we can't use them to create variables, e.g., ```c -MPI_INT my_int = 1; +MPI_INT my_data = 1; ``` is not valid code because, under the hood, these constants are actually special data structures used by MPI. @@ -106,10 +106,10 @@ need to change the type, you would only have to do it in one place, e.g.: ```c // define constants for your data types -#define MPI_INT_TYPE MPI_INT -#define INT_TYPE int +#define MPI_DATA_TYPE MPI_INT +#define DATA_TYPE int // use them as you would normally -INT_TYPE my_int = 1; +INT_TYPE my_data = 1; ``` :::: @@ -129,9 +129,9 @@ For the following pieces of data, what MPI data types should you use? The fact that `a[]` is an array does not matter, because all the elements in `a[]` will be the same data type. In MPI, as we'll see in the next episode, we can either send a single value or multiple values (in an array). -1. `MPI_INT` -2. `MPI_DOUBLE` - `MPI_FLOAT` would not be correct as `float`'s contain 32 bits of data whereas `double`s are 64 bit. -3. `MPI_BYTE` or `MPI_CHAR` - you may want to use [strlen](https://man7.org/linux/man-pages/man3/strlen.3.html) to calculate how many elements of `MPI_CHAR` being sent. +1. Use `MPI_INT` or `MPI_LONG`, depending on the type of the array. +2. The array contains floating-point values. Use `MPI_DOUBLE` if the array type is `double`, or `MPI_FLOAT` if it's declared as `float`. +3. Use `MPI_BYTE` or `MPI_CHAR` for character arrays. You may want to use [strlen](https://man7.org/linux/man-pages/man3/strlen.3.html) to calculate how many elements of `MPI_CHAR` being sent. :::: ::::: From 4d863dd168b74bf3f78921e79348a89815bc3d75 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 11:53:12 +0000 Subject: [PATCH 25/34] Fixed monospace issue and updated explanation for MPI_Send --- .../hpc_mpi/04_point_to_point_communication.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md index 605285c0..0a0fa04e 100644 --- a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md +++ b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md @@ -27,8 +27,8 @@ Let's look at how `MPI_Send()` and `MPI_Recv()`are typically used: - Rank B must know that it is about to receive a message and acknowledge this by calling `MPI_Recv()`. This sets up a buffer for writing the incoming data when it arrives and instructs the communication device to listen for the message. -As mentioned in the previous episode, `MPI_Send()` and `MPI_Recv()` are *synchronous* operations, -and will not return until the communication on both sides is complete. +Note that `MPI_Send` and `MPI_Recv()` are often used in a synchronous manner, meaning they will not return until communication is complete on both sides. +However, as mentioned in the previous episode, `MPI_Send()` may return before the communication is complete, depending on the implementation and message size. ## Sending a Message: MPI_Send() @@ -49,10 +49,10 @@ int MPI_Send( |-----------------|---------------------------------------------------------------------------------------------------------------------------------------------| | `*data`: | Pointer to the start of the data being sent. We would not expect this to change, hence it's defined as `const` | | `count`: | Number of elements to send | -| `datatype`: | The type of the element data being sent, e.g. MPI_INTEGER, MPI_CHAR, MPI_FLOAT, MPI_DOUBLE, ... | +| `datatype`: | The type of the element data being sent, e.g. `MPI_INTEGER`, `MPI_CHAR`, `MPI_FLOAT`, `MPI_DOUBLE`, ... | | `destination`: | The rank number of the rank the data will be sent to | | `tag`: | An message tag (integer), which is used to differentiate types of messages. We can specify `0` if we don't need different types of messages | -| `communicator`: | The communicator, e.g. MPI_COMM_WORLD as seen in previous episodes | +| `communicator`: | The communicator, e.g. `MPI_COMM_WORLD` as seen in previous episodes | For example, if we wanted to send a message that contains `"Hello, world!\n"` from rank 0 to rank 1, we could state (assuming we were rank 0): @@ -149,7 +149,7 @@ int main(int argc, char** argv) { MPI_Comm_rank(MPI_COMM_WORLD,&rank); if( rank == 0 ){ - constant char *message = "Hello, world!\n"; + const char *message = "Hello, world!\n"; MPI_Send(message, 14, MPI_CHAR, 1, 0, MPI_COMM_WORLD); } From 83d49cf27d54b94d18b26a667a50396daa721945 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 12:21:06 +0000 Subject: [PATCH 26/34] Updated collective operations table --- .../hpc_mpi/05_collective_communication.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/high_performance_computing/hpc_mpi/05_collective_communication.md b/high_performance_computing/hpc_mpi/05_collective_communication.md index ef5a7a52..a049a45c 100644 --- a/high_performance_computing/hpc_mpi/05_collective_communication.md +++ b/high_performance_computing/hpc_mpi/05_collective_communication.md @@ -74,12 +74,14 @@ could ever write using point-to-point communications. There are several collective operations that are implemented in the MPI standard. The most commonly-used are: -| Type | Description | -|-----------------|----------------------------------------------------------------------| -| Synchronisation | Wait until all processes have reached the same point in the program. | -| One-To-All | One rank sends the same message to all other ranks. | -| All-to-One | All ranks send data to a single rank. | -| All-to-All | All ranks have data and all ranks receive data. | +| Type | Example MPI Function | Description | +|-----------------|----------------------|----------------------------------------------------------------| +| Synchronization | Barrier | Wait until all ranks reach the same point in the program. | +| One-To-All | Broadcast | One rank sends the same message to all other ranks. | +| All-To-One | Reduce | All ranks send data to a single rank. | +| One-To-Many | Scatter | A single rank sends different parts of data to multiple ranks. | +| Many-To-One | Gather | Multiple ranks send data to a single rank. | +| All-To-All | Allreduce | All ranks have data and all ranks receive data. | ## Synchronisation From 5dfb4d76d5c264e068c4bad0b39e82787b380d61 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 12:22:48 +0000 Subject: [PATCH 27/34] Fixed typo --- .../hpc_mpi/05_collective_communication.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/high_performance_computing/hpc_mpi/05_collective_communication.md b/high_performance_computing/hpc_mpi/05_collective_communication.md index a049a45c..42158008 100644 --- a/high_performance_computing/hpc_mpi/05_collective_communication.md +++ b/high_performance_computing/hpc_mpi/05_collective_communication.md @@ -102,7 +102,7 @@ int MPI_Barrier( When a rank reaches a barrier, it will pause and wait for all the other ranks to catch up and reach the barrier as well. As ranks waiting at a barrier aren't doing anything, barriers should be used sparingly to avoid large synchronisation overheads, which affects the scalability of our program. We should also avoid using barriers in parts of our program has have complicated branches, as we may introduce a deadlock by having a barrier in only one branch. -In practise, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, +In practice, there are not that many practical use cases for a barrier in an MPI application. In a shared-memory environment, synchronisation is important to ensure consistent and controlled access to shared data. But in MPI, where each rank has its own private memory space and often resources, it's rare that we need to care about ranks becoming out-of-sync. However, one use case is when multiple ranks need to write *sequentially* to the same file. The code example below shows how you may handle this by using a barrier. ```c From 9d5c40c4bc2a449a4165b5eba793cf35a878518a Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 12:51:04 +0000 Subject: [PATCH 28/34] Fixed spelling issue --- .../hpc_mpi/05_collective_communication.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/high_performance_computing/hpc_mpi/05_collective_communication.md b/high_performance_computing/hpc_mpi/05_collective_communication.md index 42158008..6aa7973d 100644 --- a/high_performance_computing/hpc_mpi/05_collective_communication.md +++ b/high_performance_computing/hpc_mpi/05_collective_communication.md @@ -217,7 +217,7 @@ We can use `MPI_Scatter()` to split the data into *equal* sized chunks and commu ```c int MPI_Scatter( - void *sendbuff, + void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuffer, @@ -230,7 +230,7 @@ int MPI_Scatter( | | | |----------------|----------------------------------------------------------------------------------------| -| `*sendbuff`: | The data to be scattered across ranks (only important for the root rank) | +| `*sendbuf`: | The data to be scattered across ranks (only important for the root rank) | | `sendcount`: | The number of elements of data to send to each rank (only important for the root rank) | | `sendtype`: | The data type of the data being sent (only important for the root rank) | | `*recvbuffer`: | A buffer to receive data into, including the root rank | @@ -275,7 +275,7 @@ We can do this by using the collection function `MPI_Gather()`, which has these ```c int MPI_Gather( - void *sendbuff, + void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuff, @@ -288,7 +288,7 @@ int MPI_Gather( | | | |--------------|--------------------------------------------------------------------------| -| `*sendbuff`: | The data to send to the root rank | +| `*sendbuf`: | The data to send to the root rank | | `sendcount`: | The number of elements of data to send | | `sendtype`: | The data type of the data being sent | | `*recvbuff`: | The buffer to put gathered data into (only important for the root rank) | @@ -375,7 +375,7 @@ Reduction operations can be done using the collection function `MPI_Reduce()`, w ```c int MPI_Reduce( - void *sendbuff, + void *sendbuf, void *recvbuffer, int count, MPI_Datatype datatype, @@ -387,7 +387,7 @@ int MPI_Reduce( | | | |----------------|-------------------------------------------------| -| `*sendbuff`: | The data to be reduced by the root rank | +| `*sendbuf`: | The data to be reduced by the root rank | | `*recvbuffer`: | A buffer to contain the reduction output | | `count`: | The number of elements of data to be reduced | | `datatype`: | The data type of the data | @@ -428,7 +428,7 @@ In the code example just above, after the reduction operation we used `MPI_Bcast ```c int MPI_Allreduce( - void *sendbuff, + void *sendbuf, void *recvbuffer, int count, MPI_Datatype datatype, @@ -439,7 +439,7 @@ int MPI_Allreduce( | | | |----------------|--------------------------------------------------| -| `*sendbuff`: | The data to be reduced, on all ranks | +| `*sendbuf`: | The data to be reduced, on all ranks | | `*recvbuffer`: | A buffer which will contain the reduction output | | `count`: | The number of elements of data to be reduced | | `datatype`: | The data type of the data | From f32c3f7b3bc74a15b25e5a4ef06b5562de11102d Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 12:54:17 +0000 Subject: [PATCH 29/34] Fixed tyo --- .../hpc_mpi/05_collective_communication.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/high_performance_computing/hpc_mpi/05_collective_communication.md b/high_performance_computing/hpc_mpi/05_collective_communication.md index 6aa7973d..99165a33 100644 --- a/high_performance_computing/hpc_mpi/05_collective_communication.md +++ b/high_performance_computing/hpc_mpi/05_collective_communication.md @@ -406,7 +406,7 @@ The `op` argument controls which reduction operation is carried out, from the fo | `MPI_MAXLOC` | Return the maximum value and the number of the rank that sent the maximum value. | | `MPI_MINLOC` | Return the minimum value of the number of the rank that sent the minimum value. | -In a reduction operation, each ranks sends a piece of data to the root rank, which are combined, depending on the choice of operation, into a single value on the root rank, as shown in the diagram below. Since the data is sent and operation done on the root rank, it means the reduced value is only available on the root rank. +In a reduction operation, each rank sends a piece of data to the root rank, which are combined, depending on the choice of operation, into a single value on the root rank, as shown in the diagram below. Since the data is sent and operation done on the root rank, it means the reduced value is only available on the root rank. ![Each rank sending a piece of data to root rank](fig/reduction.png) From 0347cdb09f7ffc1dd0b538441c6e389ac1edb9d6 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 13:00:57 +0000 Subject: [PATCH 30/34] Fixed typos --- high_performance_computing/hpc_mpi/07-derived-data-types.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/high_performance_computing/hpc_mpi/07-derived-data-types.md b/high_performance_computing/hpc_mpi/07-derived-data-types.md index 761b707b..ed27ad7d 100644 --- a/high_performance_computing/hpc_mpi/07-derived-data-types.md +++ b/high_performance_computing/hpc_mpi/07-derived-data-types.md @@ -13,7 +13,7 @@ learningOutcomes: - Learn how to define and use derived data types. --- -We've so far seen the basic building blocks for splitting work and communicating data between ranks, meaning we're now dangerous enough to write a simple and successful MPI application. We've worked, so far, with simple data structures, such as single variables or small 1D arrays. In reality, any useful software we write will use more complex data structures, such as n-dimensional arrays, structures and other complex types. Working with these in MPI require a bit more work to communicate them correctly and efficiently. +We've so far seen the basic building blocks for splitting work and communicating data between ranks, meaning we're ready enough to write a simple and successful MPI application. We've worked, so far, with simple data structures, such as single variables or small 1D arrays. In reality, any useful software we write will use more complex data structures, such as n-dimensional arrays, structures and other complex types. Working with these in MPI require a bit more work to communicate them correctly and efficiently. To help with this, MPI provides an interface to create new types known as derived data types. A derived type acts as a way to enable the translation of complex data structures into instructions which MPI uses for efficient data access communication. In this episode, we will learn how to use derived data types to send array vectors and sub-arrays. @@ -213,7 +213,7 @@ MPI_Type_free(&rows_type); There are two things above, which look quite innocent, but are important to understand. First of all, the send buffer in `MPI_Send()` is not `matrix` but `&matrix[1][0]`. In `MPI_Send()`, the send buffer is a pointer to the memory location where the start of the data is stored. In the above example, the intention is to only send the second and forth rows, so the start location of the data to send is the address for element `[1][0]`. If we used `matrix`, the first and third rows would be sent instead. -The other thing to notice, which is not immediately clear why it's done this way, is that the receive data type is `MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This is because when a rank receives data, the data is contiguous array. We don't need to use a vector to describe the layout of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. +The other thing to notice, which is not immediately clear why it's done this way, is that the receive data type is `MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This is because when a rank receives data, the data is a contiguous array. We don't need to use a vector to describe the layout of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. ::::challenge{id=sending-columns title="Sending columns from an array"} From d577b2a71f44ab15df47320fe5011ac1840bfb8f Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 13:05:57 +0000 Subject: [PATCH 31/34] Updated phrasing --- high_performance_computing/hpc_mpi/11_advanced_communication.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/high_performance_computing/hpc_mpi/11_advanced_communication.md b/high_performance_computing/hpc_mpi/11_advanced_communication.md index 157ee740..b481d084 100644 --- a/high_performance_computing/hpc_mpi/11_advanced_communication.md +++ b/high_performance_computing/hpc_mpi/11_advanced_communication.md @@ -16,7 +16,7 @@ learningOutcomes: In an earlier episode, we introduced the concept of derived data types to send vectors or a sub-array of a larger array, which may or may not be contiguous in memory. Other than vectors, there are multiple other types of derived data types that allow us to handle other complex data structures efficiently. In this episode, we will see how to create structure derived types. Additionally, we will -also learn how to use `MPI_Pack()` and `MPI_Unpack()` to manually pack complex data structures and heterogeneous into a single contiguous +also learn how to use `MPI_Pack()` and `MPI_Unpack()` to manually pack complex data structures and heterogeneous data into a single contiguous buffer, when other methods of communication are too complicated or inefficient. ## Structures in MPI From bf17230525643a42c6e6ac557f1d34695755e1ad Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 13:13:37 +0000 Subject: [PATCH 32/34] Corrected spelling --- .../hpc_mpi/11_advanced_communication.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/high_performance_computing/hpc_mpi/11_advanced_communication.md b/high_performance_computing/hpc_mpi/11_advanced_communication.md index b481d084..270032f1 100644 --- a/high_performance_computing/hpc_mpi/11_advanced_communication.md +++ b/high_performance_computing/hpc_mpi/11_advanced_communication.md @@ -56,7 +56,7 @@ When a struct is created, it occupies a single contiguous block of memory. But t ![Memory layout for a struct](fig/struct_memory_layout.png) Although the memory used for padding and the struct's data exists in a contiguous block, -the actual data we care about is no longer contiguous. This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the stride in vectors. +the actual data we care about is no longer contiguous. This is why we need the `array_of_displacements` argument, which specifies the distance, in bytes, between each struct member relative to the start of the struct. In practice, it serves a similar purpose of the stride in vectors. To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. To do this, we can use the function `MPI_Get_address()`, @@ -266,7 +266,7 @@ array using `MPI_Pack()`. The coloured boxes in both memory representations (memory and packed) are the same chunks of data. The green boxes containing only a single number are used to document the number of elements in the block of elements they are adjacent -to, in the contiguous buffer. This is optional to do, but is generally good practise to include to create a +to, in the contiguous buffer. This is optional to do, but is generally good practice to include to create a self-documenting message. From the diagram we can see that we have "packed" non-contiguous blocks of memory into a single contiguous block. We can do this using `MPI_Pack()`. To reverse this action, and "unpack" the buffer, we use `MPI_Unpack()`. As you might expect, `MPI_Unpack()` takes a buffer, created by `MPI_Pack()` and unpacks the data back From f9151a616b15baa479e8ed8c7f67cbf54effff39 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 13:40:49 +0000 Subject: [PATCH 33/34] Fixed linting issues --- high_performance_computing/hpc_mpi/03_communicating_data.md | 2 +- .../hpc_mpi/04_point_to_point_communication.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/high_performance_computing/hpc_mpi/03_communicating_data.md b/high_performance_computing/hpc_mpi/03_communicating_data.md index 198fa2f4..4342af41 100644 --- a/high_performance_computing/hpc_mpi/03_communicating_data.md +++ b/high_performance_computing/hpc_mpi/03_communicating_data.md @@ -129,7 +129,7 @@ For the following pieces of data, what MPI data types should you use? The fact that `a[]` is an array does not matter, because all the elements in `a[]` will be the same data type. In MPI, as we'll see in the next episode, we can either send a single value or multiple values (in an array). -1. Use `MPI_INT` or `MPI_LONG`, depending on the type of the array. +1. Use `MPI_INT` or `MPI_LONG`, depending on the type of the array. 2. The array contains floating-point values. Use `MPI_DOUBLE` if the array type is `double`, or `MPI_FLOAT` if it's declared as `float`. 3. Use `MPI_BYTE` or `MPI_CHAR` for character arrays. You may want to use [strlen](https://man7.org/linux/man-pages/man3/strlen.3.html) to calculate how many elements of `MPI_CHAR` being sent. diff --git a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md index 0a0fa04e..e705095a 100644 --- a/high_performance_computing/hpc_mpi/04_point_to_point_communication.md +++ b/high_performance_computing/hpc_mpi/04_point_to_point_communication.md @@ -27,7 +27,7 @@ Let's look at how `MPI_Send()` and `MPI_Recv()`are typically used: - Rank B must know that it is about to receive a message and acknowledge this by calling `MPI_Recv()`. This sets up a buffer for writing the incoming data when it arrives and instructs the communication device to listen for the message. -Note that `MPI_Send` and `MPI_Recv()` are often used in a synchronous manner, meaning they will not return until communication is complete on both sides. +Note that `MPI_Send` and `MPI_Recv()` are often used in a synchronous manner, meaning they will not return until communication is complete on both sides. However, as mentioned in the previous episode, `MPI_Send()` may return before the communication is complete, depending on the implementation and message size. ## Sending a Message: MPI_Send() From 9d0f4ac1bf9150ff87bb14a7c9d0b791abe4ac22 Mon Sep 17 00:00:00 2001 From: "Mehtap O. Arabaci" Date: Tue, 21 Jan 2025 14:30:05 +0000 Subject: [PATCH 34/34] Fixed linting issues --- .../functional/higher_order_functions_cpp.md | 2 +- .../functional/higher_order_functions_python.md | 2 +- .../functional/side_effects_python.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/software_architecture_and_design/functional/higher_order_functions_cpp.md b/software_architecture_and_design/functional/higher_order_functions_cpp.md index 214b20aa..694f39a0 100644 --- a/software_architecture_and_design/functional/higher_order_functions_cpp.md +++ b/software_architecture_and_design/functional/higher_order_functions_cpp.md @@ -148,7 +148,7 @@ std::cout << result << std::endl; // prints 6 ## Higher-Order Functions Higher-order functions are functions that take another function as an argument -or that return a function. One of the main uses of lambda functions is to create +or that return a function. One of the main uses of lambda functions is to create temporary functions to pass into higher-order functions. To illustrate the benefits of higher-order functions, let us define two diff --git a/software_architecture_and_design/functional/higher_order_functions_python.md b/software_architecture_and_design/functional/higher_order_functions_python.md index b055241c..4c609303 100644 --- a/software_architecture_and_design/functional/higher_order_functions_python.md +++ b/software_architecture_and_design/functional/higher_order_functions_python.md @@ -70,7 +70,7 @@ Due to their simplicity, it can be useful to have a lamdba function as the inner ## Higher-Order Functions Higher-order functions are functions that take another function as an argument -or that return a function. One of the main uses of lambda functions is to create +or that return a function. One of the main uses of lambda functions is to create temporary functions to pass into higher-order functions. To illustrate the benefits of higher-order functions, let us define two diff --git a/software_architecture_and_design/functional/side_effects_python.md b/software_architecture_and_design/functional/side_effects_python.md index f0275124..d2c2251b 100644 --- a/software_architecture_and_design/functional/side_effects_python.md +++ b/software_architecture_and_design/functional/side_effects_python.md @@ -86,7 +86,7 @@ line = myfile.readline() # Same call to readline, but result is different! The main downside of having a state that is constantly updated is that it makes it harder for us to _reason_ about our code, to work out what it is doing. However, the upside is that we can use state to store temporary data and make -calculations more efficient. For example an iteration loop that keeps track of +calculations more efficient. For example an iteration loop that keeps track of a running total is a common pattern in procedural programming: ```python nolint