Skip to content

Commit

Permalink
[Grammar] 5-6 Roofline.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Aug 9, 2024
1 parent 6fe67dd commit 3de56ca
Showing 1 changed file with 8 additions and 10 deletions.
18 changes: 8 additions & 10 deletions chapters/5-Performance-Analysis-Approaches/5-6 Roofline.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@


## The Roofline Performance Model {#sec:roofline}

The Roofline performance model is a throughput-oriented performance model that is heavily used in the HPC world. It was developed at the University of California, Berkeley, in 2009. The "roofline" in this model expresses the fact that the performance of an application cannot exceed the machine's capabilities. Every function and every loop in a program is limited by either compute or memory capacity of a machine. This concept is represented in Figure @fig:RooflineIntro. The performance of an application will always be limited by a certain "roofline" function.
The Roofline performance model is a throughput-oriented performance model that is heavily used in the HPC world. It was developed at the University of California, Berkeley, in 2009. The "roofline" in this model expresses the fact that the performance of an application cannot exceed the machine's capabilities. Every function and every loop in a program is limited by either the computing or memory bandwidth capacity of a machine. This concept is represented in Figure @fig:RooflineIntro. The performance of an application will always be limited by a certain "roofline" function.

![The Roofline Performance Model. The maximum performance of an application is limited by the minimum between peak FLOPS (horizontal line) and the platform bandwidth multiplied by arithmetic intensity (diagonal line). *© Image taken from [NERSC Documentation](https://docs.nersc.gov/development/performance-debugging-tools/roofline/#arithmetic-intensity-ai-and-achieved-performance-flops-for-application-characterization).*](../../img/perf-analysis/Roofline-intro.png){#fig:RooflineIntro width=80%}

Hardware has two main limitations: how fast it can make calculations (peak compute performance, FLOPS) and how fast it can move the data (peak memory bandwidth, GB/s). The maximum performance of an application is limited by the minimum between peak FLOPS (horizontal line) and the platform bandwidth multiplied by arithmetic intensity (diagonal line). The roofline chart in Figure @fig:RooflineIntro plots the performance of two applications `A` and `B` against hardware limitations. Application `A` has lower arithmetic intensity and its performance is bound by the memory bandwidth, while application `B` is more compute intensive and doesn't suffer as much from memory bottlenecks. Similar to this, `A` and `B` could represent two different functions within a program and have different performance characteristics. The Roofline performance model accounts for that and can display multiple functions and loops of an application on the same chart.

Arithmetic Intensity is a ratio between Floating-point operations (FLOPs) and bytes, and it can be calculated for every loop in a program. Let's calculate the arithmetic intensity of code in [@lst:BasicMatMul]. In the innermost loop body, we have a floating-point addition and a multiplication; thus, we have 2 FLOPs. Also, we have three read operations and one write operation; thus, we transfer `4 flops * 4 bytes = 16` bytes. Arithmetic intensity of that code is `2 / 16 = 0.125`. Arithmetic intensity is the X-axis on the Roofline chart, while the Y-axis measures performance of a given program.
Arithmetic Intensity is a ratio between Floating-point operations (FLOPs) and bytes, and it can be calculated for every loop in a program. Let's calculate the arithmetic intensity of code in [@lst:BasicMatMul]. In the innermost loop body, we have a floating-point addition and a multiplication; thus, we have 2 FLOPs. Also, we have three read operations and one write operation; thus, we transfer `4 flops * 4 bytes = 16` bytes. The arithmetic intensity of that code is `2 / 16 = 0.125`. Arithmetic intensity is the X-axis on the Roofline chart, while the Y-axis measures the performance of a given program.

Listing: Naive parallel matrix multiplication.

Expand All @@ -25,11 +23,11 @@ void matmul(int N, float a[][2048], float b[][2048], float c[][2048]) {
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Traditional ways to speed up an application's performance is to fully utilize the SIMD and multicore capabilities of a machine. Often times, we need to optimize for many aspects: vectorization, memory, threading. Roofline methodology can assist in assessing these characteristics of your application. On a roofline chart, we can plot theoretical maximums for scalar single-core, SIMD single-core, and SIMD multicore performance (see Figure @fig:RooflineIntro2). This will give us an understanding of the scope for improving the performance of an application. If we found that our application is compute-bound (i.e., has high arithmetic intensity) and is below the peak scalar single-core performance, we should consider forcing vectorization (see [@sec:Vectorization]) and distributing the work among multiple threads. Conversely, if an application has low arithmetic intensity, we should seek ways to improve memory accesses (see [@sec:MemBound]). The ultimate goal of optimizing performance using the Roofline model is to move the points up. Vectorization and threading move the dot up while optimizing memory accesses by increasing arithmetic intensity will move the dot to the right and also likely improve performance.
Traditional ways to speed up an application's performance is to fully utilize the SIMD and multicore capabilities of a machine. Usually, we need to optimize for many aspects: vectorization, memory, and threading. Roofline methodology can assist in assessing these characteristics of your application. On a roofline chart, we can plot theoretical maximums for scalar single-core, SIMD single-core, and SIMD multicore performance (see Figure @fig:RooflineIntro2). This will give us an understanding of the scope for improving the performance of an application. If we find that our application is compute-bound (i.e., has high arithmetic intensity) and is below the peak scalar single-core performance, we should consider forcing vectorization (see [@sec:Vectorization]) and distributing the work among multiple threads. Conversely, if an application has low arithmetic intensity, we should seek ways to improve memory accesses (see [@sec:MemBound]). The ultimate goal of optimizing performance using the Roofline model is to move the points up. Vectorization and threading move the dot up while optimizing memory accesses by increasing arithmetic intensity will move the dot to the right and also likely improve performance.
![Roofline analysis of a program and potential ways to improve its performance.](../../img/perf-analysis/Roofline-intro2.jpg){#fig:RooflineIntro2 width=70%}
Theoretical maximums (roof lines) are often presented in a device specification and can be easily looked up. Also, theoretical maximums can be calculated based on the characteristics of the machine you are using. Usually, it is not hard to do once you know the parameters of your machine. For Intel Core i5-8259U processor, the maximum number of FLOPS (single-precision floats) with AVX2 and 2 Fused Multiply Add (FMA) units can be calculated as:
Theoretical maximums (roof lines) are often presented in a device specification and can be easily looked up. Also, theoretical maximums can be calculated based on the characteristics of the machine you are using. Usually, it is not hard to do once you know the parameters of your machine. For the Intel Core i5-8259U processor, the maximum number of FLOPS (single-precision floats) with AVX2 and 2 Fused Multiply Add (FMA) units can be calculated as:
$$
\begin{aligned}
\textrm{Peak FLOPS} =& \textrm{ 8 (number of logical cores)}~\times~\frac{\textrm{256 (AVX bit width)}}{\textrm{32 bit (size of float)}} ~ \times ~ \\
Expand All @@ -46,11 +44,11 @@ $$
\end{aligned}
$$
Automated tools like [Empirical Roofline Tool](https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/)[^2] and [Intel Advisor](https://software.intel.com/content/www/us/en/develop/tools/advisor.html)[^3] are capable of empirically determining theoretical maximums by running a set of prepared benchmarks. If a calculation can reuse the data in cache, much higher FLOP rates are possible. Roofline can account for that by introducing a dedicated roofline for each level of the memory hierarchy (see Figure @fig:RooflineMatrix).
Automated tools like [Empirical Roofline Tool](https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/)[^2] and [Intel Advisor](https://software.intel.com/content/www/us/en/develop/tools/advisor.html)[^3] are capable of empirically determining theoretical maximums by running a set of prepared benchmarks. If a calculation can reuse the data in the cache, much higher FLOP rates are possible. Roofline can account for that by introducing a dedicated roofline for each level of the memory hierarchy (see Figure @fig:RooflineMatrix).
After hardware limitations are determined, we can start assessing the performance of an application against the roofline. The two most frequently used methods for automated collection of Roofline data are sampling (used by [likwid](https://github.com/RRZE-HPC/likwid)[^4] tool) and binary instrumentation, which is used by Intel Software Development Emulator ([SDE](https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html)[^5]). Sampling incurs the lower overhead of collecting data, while binary instrumentation gives more accurate results.[^6] Intel Advisor automatically builds a Roofline chart and provides hints for performance optimization of a given loop. An example of a Roofline chart generated by Intel Advisor is presented in Figure @fig:RooflineMatrix. Notice, Roofline charts have logarithmic scales.
After hardware limitations are determined, we can start assessing the performance of an application against the roofline. The two most frequently used methods for automated collection of Roofline data are sampling (used by [likwid](https://github.com/RRZE-HPC/likwid)[^4] tool) and binary instrumentation, which is used by Intel Software Development Emulator ([SDE](https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html)[^5]). Sampling incurs the lower overhead of collecting data, while binary instrumentation gives more accurate results.[^6] Intel Advisor automatically builds a Roofline chart and provides hints for performance optimization of a given loop. An example of a Roofline chart generated by Intel Advisor is presented in Figure @fig:RooflineMatrix. Notice, that Roofline charts have logarithmic scales.
Roofline methodology enables tracking optimization progress by printing "before" and "after" points on the same chart. So, it is an iterative process that guides developers to help their applications to fully utilize hardware capabilities. Figure @fig:RooflineMatrix shows performance gains from making the following two changes to the code shown earlier in [@lst:BasicMatMul]:
Roofline methodology enables tracking optimization progress by printing "before" and "after" points on the same chart. So, it is an iterative process that guides developers to help their applications fully utilize hardware capabilities. Figure @fig:RooflineMatrix shows performance gains from making the following two changes to the code shown earlier in [@lst:BasicMatMul]:
* Interchange the two innermost loops (swap lines 4 and 5). This enables cache-friendly memory accesses (see [@sec:MemBound]).
* Enable autovectorization of the innermost loop using AVX2 instructions.
Expand All @@ -68,7 +66,7 @@ In summary, the Roofline performance model can help to:
* NERSC Documentation, URL: [https://docs.nersc.gov/development/performance-debugging-tools/roofline/](https://docs.nersc.gov/development/performance-debugging-tools/roofline/).
* Lawrence Berkeley National Laboratory research, URL: [https://crd.lbl.gov/departments/computer-science/par/research/roofline/](https://crd.lbl.gov/departments/computer-science/par/research/roofline/)
* A collection of video presentations about Roofline model and Intel Advisor, URL: [https://techdecoded.intel.io/](https://techdecoded.intel.io/) (search "Roofline").
* A collection of video presentations about the Roofline model and Intel Advisor, URL: [https://techdecoded.intel.io/](https://techdecoded.intel.io/) (search "Roofline").
* `Perfplot` is a collection of scripts and tools that enable a user to instrument performance counters on a recent Intel platform, measure them, and use the results to generate roofline and performance plots. URL: [https://github.com/GeorgOfenbeck/perfplot](https://github.com/GeorgOfenbeck/perfplot)
[^2]: Empirical Roofline Tool - [https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/](https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/).
Expand Down

0 comments on commit 3de56ca

Please sign in to comment.