diff --git a/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md b/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md index cf5805d1f8..fb250d48da 100644 --- a/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md +++ b/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md @@ -1,13 +1,13 @@ ## Case Study: Analyzing Performance Metrics of Four Benchmarks {#sec:PerfMetricsCaseStudy} -Putting together everything we discussed so far in this chapter, we run four benchmarks from different domains and calculated their performance metrics. First of all, let's introduce the benchmarks. +To put together everything we discussed so far in this chapter, let's look at some real-world examples. We ran four benchmarks from different domains and calculated their performance metrics. First of all, let's introduce the benchmarks. 1. Blender 3.4 - an open-source 3D creation and modeling software project. This test is of Blender's Cycles performance with the BMW27 blend file. All hardware threads are used. URL: [https://download.blender.org/release](https://download.blender.org/release). Command line: `./blender -b bmw27_cpu.blend -noaudio --enable-autoexec -o output.test -x 1 -F JPEG -f 1`. 2. Stockfish 15 - an advanced open-source chess engine. This test is a stockfish built-in benchmark. A single hardware thread is used. URL: [https://stockfishchess.org](https://stockfishchess.org). Command line: `./stockfish bench 128 1 24 default depth`. -3. Clang 15 selfbuild - this test uses clang 15 to build clang 15 compiler from sources. All hardware threads are used. URL: [https://www.llvm.org](https://www.llvm.org). Command line: `ninja -j16 clang`. -4. CloverLeaf 2018 - a Lagrangian-Eulerian hydrodynamics benchmark. All hardware threads are used. This test uses clover_bm.in input file (Problem 5). URL: [http://uk-mac.github.io/CloverLeaf](http://uk-mac.github.io/CloverLeaf). Command line: `./clover_leaf`. +3. Clang 15 self-build - this test uses Clang 15 to build the Clang 15 compiler from sources. All hardware threads are used. URL: [https://www.llvm.org](https://www.llvm.org). Command line: `ninja -j16 clang`. +4. CloverLeaf 2018 - a Lagrangian-Eulerian hydrodynamics benchmark. All hardware threads are used. This test uses clover_bm.in the input file (Problem 5). URL: [http://uk-mac.github.io/CloverLeaf](http://uk-mac.github.io/CloverLeaf). Command line: `./clover_leaf`. -For the purpose of this exercise, we run all four benchmarks on the machine with the following characteristics: +For this exercise, we run all four benchmarks on the machine with the following characteristics: * 12th Gen Alderlake Intel(R) Core(TM) i7-1260P CPU @ 2.10GHz (4.70GHz Turbo), 4P+8E cores, 18MB L3-cache * 16 GB RAM, DDR4 @ 2400 MT/s @@ -21,88 +21,88 @@ To collect performance metrics, we use `toplev.py` script that is a part of [pmu $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- ``` -Table {@tbl:perf_metrics_case_study} provides a side-by-side comparison of performance metrics for our four benchmarks. There is a lot we can learn about the nature of those workloads just by looking at the metrics. Here are the hypothesis we can make about the benchmarks before collecting performance profiles and diving deeper into the code of those applications. +Table {@tbl:perf_metrics_case_study} provides a side-by-side comparison of performance metrics for our four benchmarks. There is a lot we can learn about the nature of those workloads just by looking at the metrics. Here are the hypotheses we can make about the benchmarks before collecting performance profiles and diving deeper into the code of those applications. -* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction presents a minor bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is quite good. TLB is not a bottleneck as we very rarely miss in STLB. We ignore `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low, it's only 1.58 GB/s, far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating-point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute. +* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction presents a minor bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is quite good. TLB is not a bottleneck as we very rarely miss in STLB. We ignore the `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; an ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low, it's only 1.58 GB/s, far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating-point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute. -* __Stockfish__. We ran it using only one hardware thread, so there is zero work on E-cores, as expected. The number of L1 misses is relatively high, but then most of them are contained in L2 and L3 caches. The branch misprediction ratio is high; we pay the misprediction penalty every `215` instructions. We can estimate that we get one mispredict every `215 (instructions) / 1.80 (IPC) = 120` cycles, which is very frequently. Similar to the Blender reasoning, we can say that TLB and DRAM bandwidth is not an issue for Stockfish. Going further, we see that there is almost no FP operations in the workload. Conclusion: Stockfish is an integer compute workload, which is heavily affected by branch mispredictions. +* __Stockfish__. We ran it using only one hardware thread, so there is zero work on E-cores, as expected. The number of L1 misses is relatively high, but then most of them are contained in L2 and L3 caches. The branch misprediction ratio is high; we pay the misprediction penalty every `215` instructions. We can estimate that we get one mispredict every `215 (instructions) / 1.80 (IPC) = 120` cycles, which is very frequent. Similar to the Blender reasoning, we can say that TLB and DRAM bandwidth is not an issue for Stockfish. Going further, we see that there are almost no FP operations in the workload. Conclusion: Stockfish is an integer compute workload, which is heavily affected by branch mispredictions. -* __Clang 15 selfbuild__. Compilation of C++ code is one of the tasks which has a very flat performance profile, i.e., there are no big hotspots. Usually you will see that the running time is attributed to many different functions. First thing we spot is that P-cores are doing 68% more work than E-cores and have 42% better IPC. But both P- and E-cores have low IPC. The L*MPKI metrics don't look troubling at first glance; however, in combination with the load miss real latency (`LdMissLat`, in core clocks), we can see that the average cost of a cache miss is quite high (~77 cycles). Now, when we look at the `*STLB_MPKI` metrics, we notice substantial differences with any other benchmark we test. This is due to another aspect of the Clang compiler (and other compilers as well): the size of the binary is relatively big (more than 100 MB). The code constantly jumps to distant places causing high pressure on the TLB subsystem. As you can see the problem exists both for instructions (see `Code stlb MPKI`) and data (see `Ld stlb MPKI`). Let's proceed with our analysis. DRAM bandwidth use is higher than for the two previous benchmarks, but still is not reaching even half of the maximum memory bandwidth on our platform (which is ~25 GB/s). Another concern for us is the very small number of instructions per call (`IpCall`): only ~41 instruction per function call. This is unfortunately the nature of the compilation codebase: it has thousands of small functions. The compiler needs to be more aggressive with inlining all those functions and wrappers. Yet, we suspect that the performance overhead associated with making a function call remains an issue for the Clang compiler. Also, one can spot the high `ipBranch` and `IpMispredict` metric. For Clang compilation, every fifth instruction is a branch and one of every ~35 branches gets mispredicted. There are almost no FP or vector instructions, but this is not surprising. Conclusion: Clang has a large codebase, flat profile, many small functions, and "branchy" code; performance is affected by data cache and TLB misses, and branch mispredictions. +* __Clang 15 selfbuild__. Compilation of C++ code is one of the tasks that has a very flat performance profile, i.e., there are no big hotspots. Usually, you will see that the running time is attributed to many different functions. The first thing we spot is that P-cores are doing 68% more work than E-cores and have 42% better IPC. But both P- and E-cores have low IPC. The L*MPKI metrics don't look troubling at first glance; however, in combination with the load miss real latency (`LdMissLat`, in core clocks), we can see that the average cost of a cache miss is quite high (~77 cycles). Now, when we look at the `*STLB_MPKI` metrics, we notice substantial differences with any other benchmark we test. This is due to another aspect of the Clang compiler (and other compilers as well): the size of the binary is relatively big (more than 100 MB). The code constantly jumps to distant places causing high pressure on the TLB subsystem. As you can see the problem exists both for instructions (see `Code stlb MPKI`) and data (see `Ld stlb MPKI`). Let's proceed with our analysis. DRAM bandwidth use is higher than for the two previous benchmarks, but still is not reaching even half of the maximum memory bandwidth on our platform (which is ~25 GB/s). Another concern for us is the very small number of instructions per call (`IpCall`): only ~41 instructions per function call. This is unfortunately the nature of the compilation codebase: it has thousands of small functions. The compiler needs to be more aggressive with inlining all those functions and wrappers. Yet, we suspect that the performance overhead associated with making a function call remains an issue for the Clang compiler. Also, one can spot the high `ipBranch` and `IpMispredict` metrics. For Clang compilation, every fifth instruction is a branch and one of every ~35 branches gets mispredicted. There are almost no FP or vector instructions, but this is not surprising. Conclusion: Clang has a large codebase, flat profile, many small functions, and "branchy" code; performance is affected by data cache and TLB misses, and branch mispredictions. -* __CloverLeaf__. As before, we start with analyzing instructions and core cycles. The amount of work done by P- and E-cores is roughly the same, but it takes P-cores more time to do this work, resulting in a lower IPC of one logical thread on P-core compared to one physical E-core.[^2] The `L*MPKI` metrics is high, especially the number of L3 misses per kilo instructions. The load miss latency (`LdMissLat`) is off the charts, suggesting an extremely high price of the average cache miss. Next, we take a look at the `DRAM BW use` metric and see that memory bandwidth is fully saturated. That's the problem: all the cores in the system share the same memory bus, so they compete for access to main memory, which effectively stalls the execution. CPUs are undersupplied with the data that they demand. Going further, we can see that CloverLeaf does not suffer from mispredictions or function call overhead. The instruction mix is dominated by FP double-precision scalar operations with some parts of the code being vectorized. Conclusion: multi-threaded CloverLeaf is bound by memory bandwidth. +* __CloverLeaf__. As before, we start with analyzing instructions and core cycles. The amount of work done by P- and E-cores is roughly the same, but it takes P-cores more time to do this work, resulting in a lower IPC of one logical thread on P-core compared to one physical E-core.[^2] The `L*MPKI` metrics are high, especially the number of L3 misses per kilo instructions. The load miss latency (`LdMissLat`) is off the charts, suggesting an extremely high price of the average cache miss. Next, we take a look at the `DRAM BW use` metric and see that memory bandwidth is fully saturated. That's the problem: all the cores in the system share the same memory bus, so they compete for access to the main memory, which effectively stalls the execution. CPUs are undersupplied with the data that they demand. Going further, we can see that CloverLeaf does not suffer from mispredictions or function call overhead. The instruction mix is dominated by FP double-precision scalar operations with some parts of the code being vectorized. Conclusion: multi-threaded CloverLeaf is bound by memory bandwidth. -------------------------------------------------------------------------- -Metric Core Blender Stockfish Clang15- CloverLeaf -Name Type selfbuild +Metric           Core        Blender     Stockfish   Clang15-   CloverLeaf +Name             Type                                selfbuild ---------------- ----------- ----------- ----------- ---------- ---------- -Instructions P-core 6.02E+12 6.59E+11 2.40E+13 1.06E+12 +Instructions     P-core      6.02E+12    6.59E+11    2.40E+13   1.06E+12 -Core Cycles P-core 4.31E+12 3.65E+11 3.78E+13 5.25E+12 +Core Cycles      P-core      4.31E+12    3.65E+11    3.78E+13   5.25E+12 -IPC P-core 1.40 1.80 0.64 0.20 +IPC              P-core      1.40        1.80        0.64       0.20 -CPI P-core 0.72 0.55 1.57 4.96 +CPI              P-core      0.72        0.55        1.57       4.96 -Instructions E-core 4.97E+12 0 1.43E+13 1.11E+12 +Instructions     E-core      4.97E+12    0           1.43E+13   1.11E+12 -Core Cycles E-core 3.73E+12 0 3.19E+13 4.28E+12 +Core Cycles      E-core      3.73E+12    0           3.19E+13   4.28E+12 -IPC E-core 1.33 0 0.45 0.26 +IPC              E-core      1.33        0           0.45       0.26 -CPI E-core 0.75 0 2.23 3.85 +CPI              E-core      0.75        0           2.23       3.85 -L1MPKI P-core 3.88 21.38 6.01 13.44 +L1MPKI           P-core      3.88        21.38       6.01       13.44 -L2MPKI P-core 0.15 1.67 1.09 3.58 +L2MPKI           P-core      0.15        1.67        1.09       3.58 -L3MPKI P-core 0.04 0.14 0.56 3.43 +L3MPKI           P-core      0.04        0.14        0.56       3.43 -Br. Misp. Ratio E-core 0.02 0.08 0.03 0.01 +Br. Misp. Ratio  E-core      0.02        0.08        0.03       0.01 -Code stlb MPKI P-core 0 0.01 0.35 0.01 +Code stlb MPKI   P-core      0           0.01        0.35       0.01 -Ld stlb MPKI P-core 0.08 0.04 0.51 0.03 +Ld stlb MPKI     P-core      0.08        0.04        0.51       0.03 -St stlb MPKI P-core 0 0.01 0.06 0.1 - -LdMissLat (Clk) P-core 12.92 10.37 76.7 253.89 +St stlb MPKI     P-core      0           0.01        0.06       0.1 -ILP P-core 3.67 3.65 2.93 2.53 +LdMissLat (Clk)  P-core      12.92       10.37       76.7       253.89 -MLP P-core 1.61 2.62 1.57 2.78 +ILP              P-core      3.67        3.65        2.93       2.53 -DRAM BW (GB/s) All 1.58 1.42 10.67 24.57 - -IpCall All 176.8 153.5 40.9 2,729 +MLP              P-core      1.61        2.62        1.57       2.78 -IpBranch All 9.8 10.1 5.1 18.8 +DRAM BW (GB/s)   All         1.58        1.42        10.67      24.57 -IpLoad All 3.2 3.3 3.6 2.7 +IpCall           All         176.8       153.5       40.9       2,729 -IpStore All 7.2 7.7 5.9 22.0 +IpBranch         All         9.8         10.1        5.1        18.8 -IpMispredict All 610.4 214.7 177.7 2,416 +IpLoad           All         3.2         3.3         3.6        2.7 -IpFLOP All 1.1 1.82E+06 286,348 1.8 +IpStore          All         7.2         7.7         5.9        22.0 -IpArith All 4.5 7.96E+06 268,637 2.1 +IpMispredict     All         610.4       214.7       177.7      2,416 -IpArith Scal SP All 22.9 4.07E+09 280,583 2.60E+09 +IpFLOP           All         1.1         1.82E+06    286,348    1.8 -IpArith Scal DP All 438.2 1.22E+07 4.65E+06 2.2 +IpArith          All         4.5         7.96E+06    268,637    2.1 -IpArith AVX128 All 6.9 0.0 1.09E+10 1.62E+09 +IpArith Scal SP  All         22.9        4.07E+09    280,583    2.60E+09 -IpArith AVX256 All 30.3 0.0 0.0 39.6 +IpArith Scal DP  All         438.2       1.22E+07    4.65E+06   2.2 -IpSWPF All 90.2 2,565 105,933 172,348 +IpArith AVX128   All         6.9         0.0         1.09E+10   1.62E+09 + +IpArith AVX256   All         30.3        0.0         0.0        39.6 + +IpSWPF           All         90.2        2,565       105,933    172,348 -------------------------------------------------------------------------- Table: Performance Metrics of Four Benchmarks. {#tbl:perf_metrics_case_study} -As you can see from this study, there is a lot one can learn about behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that you will need to collect performance profile, which we will introduce in later chapters. In Part 2 of this book we will discuss how to mitigate the performance issues we suspect take place in the four benchmarks that we have analyzed. +As you can see from this study, there is a lot one can learn about the behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that, you will need to collect a performance profile, which we will introduce in later chapters. In Part 2 of this book, we will discuss how to mitigate the performance issues we suspect take place in the four benchmarks that we have analyzed. -Keep in mind that the summary of performance metrics in Table {@tbl:perf_metrics_case_study} only tells you about the *average* behavior of a program. For example, we might be looking at CloverLeaf's IPC of `0.2`, while in reality it may never run with such an IPC, instead it may have 2 phases of equal duration, one running with IPC of `0.1`, and the second with IPC of `0.3`. Performance tools tackle this by reporting statistical data for each metric along with the average value. Usually, having min, max, 95th percentile, and variation (stdev/avg) is enough to understand the distribution. Also, some tools allow plotting the data, so you can see how the value for a certain metric changed during the program running time. As an example, Figure @fig:CloverMetricCharts shows the dynamics of IPC, L*MPKI, DRAM BW and average frequency for the CloverLeaf benchmark. The `pmu-tools` package can automatically build those charts once you add `--xlsx` and `--xchart` options. The `-I 10000` option aggregates collected samples with 10 second intervals. +Keep in mind that the summary of performance metrics in Table {@tbl:perf_metrics_case_study} only tells you about the *average* behavior of a program. For example, we might be looking at CloverLeaf's IPC of `0.2`, while in reality, it may never run with such an IPC, instead, it may have 2 phases of equal duration, one running with an IPC of `0.1`, and the second with IPC of `0.3`. Performance tools tackle this by reporting statistical data for each metric along with the average value. Usually, having min, max, 95th percentile, and variation (stdev/avg) is enough to understand the distribution. Also, some tools allow plotting the data, so you can see how the value for a certain metric changed during the program running time. As an example, Figure @fig:CloverMetricCharts shows the dynamics of IPC, L*MPKI, DRAM BW, and average frequency for the CloverLeaf benchmark. The `pmu-tools` package can automatically build those charts once you add the `--xlsx` and `--xchart` options. The `-I 10000` option aggregates collected samples with 10-second intervals. ```bash $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v --xlsx workload.xlsx –xchart -I 10000 -- ./clover_leaf @@ -110,9 +110,9 @@ $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v --xlsx workload.xlsx ![Performance metrics charts for the CloverLeaf benchmark with 10 second intervals.](../../img/terms-and-metrics/CloverMetricCharts2.png){#fig:CloverMetricCharts width=100% } -Even though the deviation from the average values reported in the summary is not very big, we can see that the workload is not stable. After looking at the IPC chart for P-core we can hypothesize that there are no distinct phases in the workload and the variation is caused by multiplexing between performance events (discussed in [@sec:counting]). Yet, this is only a hypothesis that needs to be confirmed or disproved. Possible ways to proceed would be to collect more data points by running collection with higher granularity (in our case it was 10 seconds). The chart that plots L*MPKI suggests that all three metrics hover around their average numbers without much deviation. The DRAM bandwidth utilization chart indicates that there are periods with varying pressure on the main memory. The last chart shows the average frequency of all CPU cores. As you may observe on this chart, throttling starts after the first 10 seconds. We recommend to be careful when drawing conclusions just from looking at the aggregate numbers since they may not be a good representation of the workload behavior. +Even though the deviation from the average values reported in the summary is not very big, we can see that the workload is not stable. After looking at the IPC chart for P-core we can hypothesize that there are no distinct phases in the workload and the variation is caused by multiplexing between performance events (discussed in [@sec:counting]). Yet, this is only a hypothesis that needs to be confirmed or disproved. Possible ways to proceed would be to collect more data points by running collection with higher granularity (in our case it was 10 seconds). The chart that plots L*MPKI suggests that all three metrics hover around their average numbers without much deviation. The DRAM bandwidth utilization chart indicates that there are periods with varying pressure on the main memory. The last chart shows the average frequency of all CPU cores. As you may observe on this chart, throttling starts after the first 10 seconds. We recommend being careful when drawing conclusions just from looking at the aggregate numbers since they may not be a good representation of the workload behavior. -In summary, performance metrics help you build a right mental model about what is and what is *not* happening in a program. Going further into analysis, this data will serve you well. +In summary, performance metrics help you build the right mental model about what is and what is *not* happening in a program. Going further into analysis, this data will serve you well. [^1]: pmu-tools - [https://github.com/andikleen/pmu-tools](https://github.com/andikleen/pmu-tools) -[^2]: A possible explanation to that is because CloverLeaf is very memory-bandwidth bound. All P- and E-cores are equally stalled waiting on memory. Because P-cores have higher frequency, they waste more CPU clocks than E-cores. \ No newline at end of file +[^2]: A possible explanation for that is because CloverLeaf is very memory-bandwidth bound. All P- and E-cores are equally stalled waiting on memory. Because P-cores have a higher frequency, they waste more CPU clocks than E-cores. \ No newline at end of file diff --git a/chapters/5-Performance-Analysis-Approaches/5-0 Performance analysis approaches.md b/chapters/5-Performance-Analysis-Approaches/5-0 Performance analysis approaches.md index b269039b13..3a6459077a 100644 --- a/chapters/5-Performance-Analysis-Approaches/5-0 Performance analysis approaches.md +++ b/chapters/5-Performance-Analysis-Approaches/5-0 Performance analysis approaches.md @@ -1,15 +1,13 @@ - - # Performance Analysis Approaches {#sec:sec_PerfApproaches} When you're working on a high-level optimization, e.g., integrating a better algorithm into an application, it is usually easy to tell whether the performance improves or not since the benchmarking results are pronounced well. Big speedups, like 2x, 3x, etc., are relatively easy from performance analysis perspective. When you eliminate an extensive computation from a program, you expect to see a visible difference in the running time. -But also, there are situations when you see a small change in the execution time, say 5%, and you have no clue where it's coming from. Timing or throughput measurements alone do not provide any explanation on why performance goes up or down. In this case, we need more insights about how a program executes. That is the situation when we need to do performance analysis to understand the underlying nature of the slowdown or speedup that we observe. +But also, there are situations when you see a small change in the execution time, say 5%, and you have no clue where it's coming from. Timing or throughput measurements alone do not provide any explanation for why performance goes up or down. In this case, we need more insights about how a program executes. That is the situation when we need to do performance analysis to understand the underlying nature of the slowdown or speedup that we observe. Performance analysis is akin to detective work. To solve a performance mystery, you need to gather all the data that you can and try to form a hypothesis. Once a hypothesis is made, you design an experiment that will either prove or disprove it. It can go back and forth a few times before you find a clue. And just like a good detective, you try to collect as many pieces of evidence as possible to confirm or refute your hypothesis. Once you have enough clues, you make a compelling explanation for the behavior you're observing. -When you just start working on a performance issue, you probably only have measurements, e.g., before and after the code change. Based on that measurements you conclude that the program became slower by `X` percent. If you know that the slowdown occurred right after a certain commit, that may already give you enough information to fix the problem. But if you don't have good reference points, then the set of possible reasons for the slowdown is endless, and you need to gather more data. One of the most popular approaches for collecting such data is to profile an application and look at the hotspots. This chapter introduces this and several other approaches for gathering data that have proven to be useful in performance engineering. +When you just start working on a performance issue, you probably only have measurements, e.g., before and after the code change. Based on those measurements you conclude that the program became slower by `X` percent. If you know that the slowdown occurred right after a certain commit, that may already give you enough information to fix the problem. But if you don't have good reference points, then the set of possible reasons for the slowdown is endless, and you need to gather more data. One of the most popular approaches for collecting such data is to profile an application and look at the hotspots. This chapter introduces this and several other approaches for gathering data that have proven to be useful in performance engineering. -The next question comes: "What performance data is available and how to collect it?" Both hardware and software layers of the stack have facilities to track performance events and record them while a program is running. In this context, by hardware, we mean the CPU, which executes the program, and by software, we mean the OS, libraries, the application itself, and other tools used for the analysis. Typically, the software stack provides high-level metrics like time, number of context switches, and page-faults, while CPU monitors cache misses, branch mispredictions, and other CPU-related events. Depending on the problem you are trying to solve, some metrics are more useful than others. So, it doesn't mean that hardware metrics will always give us a more precise overview of the program execution. Some metrics, like the number of context-switches, for instance, cannot be provided by a CPU. Performance analysis tools, like Linux Perf, can consume data from both the OS and the CPU. +The next question comes: "What performance data is available and how to collect it?" Both hardware and software layers of the stack have facilities to track performance events and record them while a program is running. In this context, by hardware, we mean the CPU, which executes the program, and by software, we mean the OS, libraries, the application itself, and other tools used for the analysis. Typically, the software stack provides high-level metrics like time, number of context switches, and page faults, while the CPU monitors cache misses, branch mispredictions, and other CPU-related events. Depending on the problem you are trying to solve, some metrics are more useful than others. So, it doesn't mean that hardware metrics will always give us a more precise overview of the program execution. Some metrics, like the number of context switches, for instance, cannot be provided by a CPU. Performance analysis tools, like Linux Perf, can consume data from both the OS and the CPU. As you have probably guessed, there are hundreds of data sources that a performance engineer may use. Since this book is about CPU low-level performance, we will focus on collecting hardware-level information. We will introduce some of the most popular performance analysis techniques: code instrumentation, tracing, Characterization, sampling, and the Roofline model. We also discuss static performance analysis techniques and compiler optimization reports that do not involve running the actual application. \ No newline at end of file diff --git a/chapters/5-Performance-Analysis-Approaches/5-1 Code instrumentation.md b/chapters/5-Performance-Analysis-Approaches/5-1 Code instrumentation.md index 87c9935387..0114426b1d 100644 --- a/chapters/5-Performance-Analysis-Approaches/5-1 Code instrumentation.md +++ b/chapters/5-Performance-Analysis-Approaches/5-1 Code instrumentation.md @@ -1,23 +1,21 @@ - - ## Code Instrumentation {#sec:secInstrumentation} -Probably the first approach for doing performance analysis ever invented is code *instrumentation*. It is a technique that inserts extra code into a program to collect specific runtime information. [@lst:CodeInstrumentation] shows the simplest example of inserting a `printf` statement at the beginning of a function to indicate when this function is called. After that you run the program and count the number of times you see "foo is called" in the output. Perhaps, every programmer in the world did this at some point of their career at least once. +Probably the first approach for doing performance analysis ever invented is code *instrumentation*. It is a technique that inserts extra code into a program to collect specific runtime information. [@lst:CodeInstrumentation] shows the simplest example of inserting a `printf` statement at the beginning of a function to indicate when this function is called. After that, you run the program and count the number of times you see "foo is called" in the output. Perhaps, every programmer in the world did this at some point in their career at least once. Listing: Code Instrumentation ~~~~ {#lst:CodeInstrumentation .cpp} int foo(int x) { + printf("foo is called\n"); - // function body... + // function body... } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The plus sign at the beginning of a line means that this line was added and is not present in the original code. In general, instrumentation code is not meant to be pushed into the codebase, it's rather for collecting the needed data and then can be thrown away. -A slightly more interesting example of code instrumentation is presented in [@lst:CodeInstrumentationHistogram]. In this made-up code example, the function `findObject` searches for the coordinates of an object with some properties `p` on a map. The function `findObj` returns the confidence level of locating the right object with the current coordinates `c`. If it is an exact match, we stop the search loop and return the coordinates. If the confidence is above the `threshold`, we choose to `zoomIn` to find more precise location of the object. Otherwise, we get the new coordinates within the `searchRadius` to try our search next time. +A slightly more interesting example of code instrumentation is presented in [@lst:CodeInstrumentationHistogram]. In this made-up code example, the function `findObject` searches for the coordinates of an object with some properties `p` on a map. The function `findObj` returns the confidence level of locating the right object with the current coordinates `c`. If it is an exact match, we stop the search loop and return the coordinates. If the confidence is above the `threshold`, we choose to `zoomIn` to find a more precise location of the object. Otherwise, we get the new coordinates within the `searchRadius` to try our search next time. -Instrumentation code consists of two classes: `histogram` and `incrementor`. The former keeps track of whatever variable values we are interested in and frequencies of their occurrence and then prints the histogram *after* the program finishes. The latter is just a helper class for pushing values into the `histogram` object. It is simple enough and can be adjusted to your specific needs quickly. I have a slightly more advanced version of this code which I usually copy-paste into whatever project I'm working on and then delete. +The instrumentation code consists of two classes: `histogram` and `incrementor`. The former keeps track of whatever variable values we are interested in and frequencies of their occurrence and then prints the histogram *after* the program finishes. The latter is just a helper class for pushing values into the `histogram` object. It is simple enough and can be adjusted to your specific needs quickly. I have a slightly more advanced version of this code which I usually copy-paste into whatever project I'm working on and then delete. Listing: Code Instrumentation @@ -59,49 +57,49 @@ Coords findObject(const ObjParams& p, Coords c, float searchRadius) { } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this hypothetical scenario, we added instrumentation to know how frequently we `zoomIn` before we find an object. The variable `inc.tripCount` counts the number of iterations the loop runs before it exits, and the variable `inc.zoomCount` counts how many times we reduce the search radius. We always expect `inc.zoomCount` to be less or equal `inc.tripCount`. Here is the output you may observe after running the instrumented program: +In this hypothetical scenario, we added instrumentation to know how frequently we `zoomIn` before we find an object. The variable `inc.tripCount` counts the number of iterations the loop runs before it exits, and the variable `inc.zoomCount` counts how many times we reduce the search radius. We always expect `inc.zoomCount` to be less or equal to `inc.tripCount`. Here is the output you may observe after running the instrumented program: ``` -[7][6]: 2 -[7][5]: 6 -[7][4]: 20 -[7][3]: 156 -[7][2]: 967 -[7][1]: 3685 -[7][0]: 251004 -[6][5]: 2 -[6][4]: 7 -[6][3]: 39 -[6][2]: 300 -[6][1]: 1235 -[6][0]: 91731 -[5][4]: 9 -[5][3]: 32 -[5][2]: 160 -[5][1]: 764 -[5][0]: 34142 +[7][6]:  2 +[7][5]:  6 +[7][4]:  20 +[7][3]:  156 +[7][2]:  967 +[7][1]:  3685 +[7][0]:  251004 +[6][5]:  2 +[6][4]:  7 +[6][3]:  39 +[6][2]:  300 +[6][1]:  1235 +[6][0]:  91731 +[5][4]:  9 +[5][3]:  32 +[5][2]:  160 +[5][1]:  764 +[5][0]:  34142 ... ``` -The first number in the square bracket is the trip count of the loop, and the second is the number of `zoomIn`s we made within the same loop. The number after the column sign is the number of occurrences of that particular combination of the numbers. For example, two times we observed 7 loop iterations and 6 `zoomIn`s, 251004 times the loop ran 7 iterations and no `zoomIn`s, and so on. You can then plot the data for better visualization, employ some other statistical methods, but the main point we can make is that `zoomIn`s are not frequent. There were a total of 10k `zoomIn` calls for the 400k times that `findObject` was called. +The first number in the square bracket is the trip count of the loop, and the second is the number of `zoomIn`s we made within the same loop. The number after the column sign is the number of occurrences of that particular combination of the numbers. For example, two times we observed 7 loop iterations and 6 `zoomIn`s, 251004 times the loop ran 7 iterations and no `zoomIn`s, and so on. You can then plot the data for better visualization, or employ some other statistical methods, but the main point we can make is that `zoomIn`s are not frequent. There were a total of 10k `zoomIn` calls for the 400k times that `findObject` was called. -Later chapters of this book contain many examples of how such information can be used for data-driven optimizations. In our case, we conclude that `findObj` often fails to find the object. It means that the next iteration of the loop will try to find the object using new coordinates but still within the same search radius. Knowing that, we could attempt a number of optimizations: 1) run multiple searches in parallel, and synchronize if any of them succeeded; 2) precompute certain things for the current search region, thus eliminating repetitive work inside `findObj`; 3) write a software pipeline that calls `getNewCoords` to generate the next set of required coordinates and prefetch the corresponding map locations from memory. Part 2 of this book looks deeper into some of this techniques. +Later chapters of this book contain many examples of how such information can be used for data-driven optimizations. In our case, we conclude that `findObj` often fails to find the object. It means that the next iteration of the loop will try to find the object using new coordinates but still within the same search radius. Knowing that, we could attempt a number of optimizations: 1) run multiple searches in parallel, and synchronize if any of them succeeded; 2) precompute certain things for the current search region, thus eliminating repetitive work inside `findObj`; 3) write a software pipeline that calls `getNewCoords` to generate the next set of required coordinates and prefetch the corresponding map locations from memory. Part 2 of this book looks deeper into some of these techniques. -Code instrumentation provides very detailed information when you need specific knowledge about the execution of the program. It allows us to track any information about every variable in the program. Using such a method often yields the best insight when optimizing big pieces of code because you can use a top-down approach (instrumenting the main function then drilling down to its callees) of locating performance issues. While code instrumentation is not very helpful in the case of small programs, it gives the most value and insight by letting developers observe the architecture and flow of an application. This technique is especially helpful for someone working with an unfamiliar codebase. +Code instrumentation provides very detailed information when you need specific knowledge about the execution of the program. It allows us to track any information about every variable in the program. Using such a method often yields the best insight when optimizing big pieces of code because you can use a top-down approach (instrumenting the main function and then drilling down to its callees) to locate performance issues. While code instrumentation is not very helpful in the case of small programs, it gives the most value and insight by letting developers observe the architecture and flow of an application. This technique is especially helpful for someone working with an unfamiliar codebase. -It's also worth mentioning that code instrumentation shines in complex systems with many different components that react differently based on inputs or over time. For example, in games, usually, there is a renderer thread, a physics thread, an animations thread, etc. Instrumenting such big modules helps to reasonably quickly understand what module is the source of issues. As sometimes, optimizing is not only a matter of optimizing code but also data. For example, rendering might be too slow because of uncompressed mesh, or physics might be too slow because of too many objects in a scene. +It's also worth mentioning that code instrumentation shines in complex systems with many different components that react differently based on inputs or over time. For example, in games, usually, there is a renderer thread, a physics thread, an animations thread, etc. Instrumenting such big modules helps to reasonably quickly understand what module is the source of issues. Sometimes, optimizing is not only a matter of optimizing code but also data. For example, rendering might be too slow because of uncompressed mesh. Or physics simulation might be too slow because of too many objects in a scene. -The instrumentation technique is heavily used in performance analysis of real-time scenarios, such as video games and embedded development. Some profilers combine instrumentation with other techniques such as tracing or sampling. We will look at one such hybrid profilers called Tracy in [@sec:Tracy]. +The instrumentation technique is heavily used in performance analysis of real-time scenarios, such as video games and embedded development. Some profilers combine instrumentation with other techniques such as tracing or sampling. We will look at one such hybrid profiler called Tracy in [@sec:Tracy]. While code instrumentation is powerful in many cases, it does not provide any information about how code executes from the OS or CPU perspective. For example, it can't give you information about how often the process was scheduled in and out of execution (known by the OS) or how many branch mispredictions occurred (known by the CPU). Instrumented code is a part of an application and has the same privileges as the application itself. It runs in userspace and doesn't have access to the kernel. -A more important downside of this technique is that every time something new needs to be instrumented, say another variable, recompilation is required. This can become a burden and increase analysis time. Unfortunately, there are additional downsides. Since usually, you care about hot paths in the application, you're instrumenting the things that reside in the performance-critical part of the code. Injecting instrumentation code in a hot path might easily result in a 2x slowdown of the overall benchmark. Remember not to benchmark an instrumented program. Because by instrumenting the code, you change the behavior of the program, so you might not see the same effects you saw earlier. +A more important downside of this technique is that every time something new needs to be instrumented, say another variable, recompilation is required. This can become a burden and increase analysis time. Unfortunately, there are additional downsides. Since usually, you care about hot paths in the application, you're instrumenting the things that reside in the performance-critical part of the code. Injecting instrumentation code in a hot path might easily result in a 2x slowdown of the overall benchmark. Remember not to benchmark an instrumented program. By instrumenting the code, you change the behavior of the program, so you might not see the same effects you saw earlier. -All of the above increases time between experiments and consumes more development time, which is why engineers don't manually instrument their code very often these days. However, automated code instrumentation is still widely used by compilers. Compilers are capable of automatically instrumenting an entire program (except third-party libraries) to collect interesting statistics about the execution. The most widely known use cases for automated instrumentation are code coverage analysis and Profile Guided Optimizations (see [@sec:secPGO]). +All of the above increases the time between experiments and consumes more development time, which is why engineers don't manually instrument their code very often these days. However, automated code instrumentation is still widely used by compilers. Compilers are capable of automatically instrumenting an entire program (except third-party libraries) to collect interesting statistics about the execution. The most widely known use cases for automated instrumentation are code coverage analysis and Profile-Guided Optimization (see [@sec:secPGO]). When talking about instrumentation, it's important to mention binary instrumentation techniques. The idea behind binary instrumentation is similar but it is done on an already-built executable file rather than on source code. There are two types of binary instrumentation: static (done ahead of time) and dynamic (instrumented code is inserted on-demand as a program executes). The main advantage of dynamic binary instrumentation is that it does not require program recompilation and relinking. Also, with dynamic instrumentation, one can limit the amount of instrumentation to only interesting code regions, instead of instrumenting the entire program. -Binary instrumentation is very useful in performance analysis and debugging. One of the most popular tools for binary instrumentation is the Intel Pin[^1] tool. Pin intercepts the execution of a program at the occurrence of an interesting event and generates new instrumented code starting at this point in the program. This enables the collecting of various runtime information. One of the most popular tools that is built on top of Pin is Intel SDE (Software Development Emulator),[^2] that enables collecting: +Binary instrumentation is very useful in performance analysis and debugging. One of the most popular tools for binary instrumentation is the Intel Pin[^1] tool. Pin intercepts the execution of a program at the occurrence of an interesting event and generates new instrumented code starting at this point in the program. This enables the collection of various runtime information. One of the most popular tools that is built on top of Pin is Intel SDE (Software Development Emulator),[^2] that enables collecting: * instruction count and function call counts, * instruction mix analysis, diff --git a/chapters/5-Performance-Analysis-Approaches/5-2 Tracing.md b/chapters/5-Performance-Analysis-Approaches/5-2 Tracing.md index d7b9a9f47f..1dba262361 100644 --- a/chapters/5-Performance-Analysis-Approaches/5-2 Tracing.md +++ b/chapters/5-Performance-Analysis-Approaches/5-2 Tracing.md @@ -1,5 +1,3 @@ - - ## Tracing Tracing is conceptually very similar to instrumentation yet is slightly different. Code instrumentation assumes that the user has full access to the source code of their application. On the other hand, tracing relies on the existing instrumentation. For example, the `strace` tool enables us to trace system calls and can be thought of as the instrumentation of the Linux kernel. Intel Processor Traces (see Appendix D) enables you to log instructions executed by a program and can be thought of as instrumentation of a CPU. Traces can be obtained from components that were appropriately instrumented in advance and are not subject to change. Tracing is often used as a black-box approach, where a user cannot modify the code of an application, yet they want to get insights into what the program is doing behind the scenes. @@ -7,7 +5,7 @@ Tracing is conceptually very similar to instrumentation yet is slightly differen An example of tracing system calls with the Linux `strace` tool is provided in [@lst:strace], which shows the first several lines of output when running the `git status` command. By tracing system calls with `strace` it's possible to know the timestamp for each system call (the leftmost column), its exit status, and the duration of each system call (in the angle brackets). Listing: Tracing system calls with strace. - + ~~~~ {#lst:strace .bash} $ strace -tt -T -- git status 17:46:16.798861 execve("/usr/bin/git", ["git", "status"], 0x7ffe705dcd78 @@ -27,12 +25,11 @@ $ strace -tt -T -- git status ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The overhead of tracing very much depends on what exactly we try to trace. For example, if we trace a program that almost never makes system calls, the overhead of running it under `strace` will be close to zero. On the other hand, if we trace a program that heavily relies on system calls, the overhead could be very large, e.g., 100x.[^1] Also, tracing can generate a massive amount of data since it doesn't skip any sample. To compensate for this, tracing tools -provide filters that enable you to restrict data collection to a specific time slice or for a specific section of code. +The overhead of tracing very much depends on what exactly we try to trace. For example, if we trace a program that rarely makes system calls, the overhead of running it under `strace` will be close to zero. On the other hand, if we trace a program that heavily relies on system calls, the overhead could be very large, e.g., 100x.[^1] Also, tracing can generate a massive amount of data since it doesn't skip any sample. To compensate for this, tracing tools provide filters that enable you to restrict data collection to a specific time slice or for a specific section of code. -Similar to instrumentation, tracing can be used for exploring anomalies in a system. For example, you may want to determine what was going on in an application during a 10s period of unresponsiveness. As you will see later, sampling methods are not designed for this, but with tracing, you can see what lead to the program being unresponsive. For example, with Intel PT, you can reconstruct the control flow of the program and know exactly what instructions were executed. +Similar to instrumentation, tracing can be used for exploring anomalies in a system. For example, you may want to determine what was going on in an application during a 10s period of unresponsiveness. As you will see later, sampling methods are not designed for this, but with tracing, you can see what leads to the program being unresponsive. For example, with Intel PT, you can reconstruct the control flow of the program and know exactly what instructions were executed. -Tracing is also very useful for debugging. Its underlying nature enables "record and replay" use cases based on recorded traces. One such tool is the Mozilla [rr](https://rr-project.org/)[^2] debugger, which performs record and replay of processes, supports backwards single stepping and much more. Most tracing tools are capable of decorating events with timestamps, which enables us to find correlations with external events that were happening during that time. That is, when we observe a glitch in a program, we can take a look at the traces of our application and correlate this glitch with what was happening in the whole system during that time. +Tracing is also very useful for debugging. Its underlying nature enables "record and replay" use cases based on recorded traces. One such tool is the Mozilla [rr](https://rr-project.org/)[^2] debugger, which performs record and replay of processes, supports backward single stepping, and much more. Most tracing tools are capable of decorating events with timestamps, which enables us to find correlations with external events that were happening during that time. That is, when we observe a glitch in a program, we can take a look at the traces of our application and correlate this glitch with what was happening in the whole system during that time. [^1]: An article about `strace` by B. Gregg - [http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html](http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html) diff --git a/chapters/5-Performance-Analysis-Approaches/5-3 Characterization.md b/chapters/5-Performance-Analysis-Approaches/5-3 Characterization.md index ab015c9fa1..06e9d4c414 100644 --- a/chapters/5-Performance-Analysis-Approaches/5-3 Characterization.md +++ b/chapters/5-Performance-Analysis-Approaches/5-3 Characterization.md @@ -1,12 +1,10 @@ - - ## Workload Characterization {#sec:counting} -Workload characterization is a process of describing a workload by means of quantitative parameters and functions. In simple words, it means counting total number of certain performance monitoring events. The goal of characterization is to define the behavior of the workload and extract its most important features. At a high level, an application can belong to one or many types: interactive, database, real-time, network-based, massively parallel, etc. Different workloads can be characterized using different metrics and parameters to address a particular application domain. +Workload characterization is a process of describing a workload by means of quantitative parameters and functions. In simple words, it means counting the total number of certain performance monitoring events. The goal of characterization is to define the behavior of the workload and extract its most important features. At a high level, an application can belong to one or many types: interactive, database, real-time, network-based, massively parallel, etc. Different workloads can be characterized using different metrics and parameters to address a particular application domain. -This is a book about low-level performance, remember? So, we will focus on extracting features related to the CPU and memory performance. The best example of a workload characterization from a CPU perspective is Top-down Microarchitecture Analysis (TMA) methodology, which we will closely look at in [@sec:TMA]. TMA attempts to characterize an application by putting it into one of 4 buckets: CPU Front End, CPU Back End, Retiring, and Bad Speculation, depending on what is causing the most significant performance bottleneck. TMA uses Performance Monitoring Counters (PMCs) to collect the needed information and identify the inefficient use of CPU microarchitecture. +This is a book about low-level performance, remember? So, we will focus on extracting features related to the CPU and memory performance. The best example of a workload characterization from a CPU perspective is the Top-down Microarchitecture Analysis (TMA) methodology, which we will closely look at in [@sec:TMA]. TMA attempts to characterize an application by putting it into one of 4 buckets: CPU Front End, CPU Back End, Retiring, and Bad Speculation, depending on what is causing the most significant performance bottleneck. TMA uses Performance Monitoring Counters (PMCs) to collect the needed information and identify the inefficient use of CPU microarchitecture. -But even without a fully fledged characterization methodology, collecting total number of certain performance events can be helpful. We hope that the case studies in the previous chapter that examined performance metrics of four different benchmarks, proved that. PMCs are a very important instrument of low-level performance analysis. They can provide unique information about the execution of a program. PMCs are generally used in two modes: "Counting" and "Sampling". Counting mode is used for workload characterization, while sampling mode is used for finding hotspots, which we will discuss soon. +But even without a fully-fledged characterization methodology, collecting a total number of certain performance events can be helpful. We hope that the case studies in the previous chapter that examined performance metrics of four different benchmarks, proved that. PMCs are a very important instrument of low-level performance analysis. They can provide unique information about the execution of a program. PMCs are generally used in two modes: "Counting" and "Sampling". The counting mode is used for workload characterization, while the sampling mode is used for finding hotspots, which we will discuss soon. ### Counting Performance Monitoring Events @@ -14,7 +12,7 @@ The idea behind counting is very simple: we want to count the total number of ce ![Counting performance events.](../../img/perf-analysis/CountingFlow.png){#fig:Counting width=80%} -The steps outlined in Figure @fig:Counting roughly represent what a typical analysis tool will do to count performance events. This process is implemented in the `perf stat` tool, which can be used to count various hardware events, like the number of instructions, cycles, cache-misses, etc. Below is an example of the output from `perf stat`: +The steps outlined in Figure @fig:Counting roughly represent what a typical analysis tool will do to count performance events. This process is implemented in the `perf stat` tool, which can be used to count various hardware events, like the number of instructions, cycles, cache misses, etc. Below is an example of the output from `perf stat`: ```bash $ perf stat -- ./my_program.exe @@ -24,11 +22,11 @@ $ perf stat -- ./my_program.exe 239298395 branch-misses # 7,96% of all branches ``` -It is very informative to know this data. First of all, it enables us to quickly spot some anomalies, such as a high branch misprediction rate or low IPC. In addition, it might come in handy when you've made a code change and you want to verify that the change has improved performance. Looking at relevant numbers might help you justify or reject the code change. The `perf stat` utility can be used as a lightweight benchmark wrapper. Since the overhead of counting events is minimal, almost all benchmarks can be automatically ran under `perf stat`. It serves as a first step in performance investigation. Sometimes anomalies can be spotted right away, which can save you some analysis time. +It is very informative to know this data. First of all, it enables us to quickly spot some anomalies, such as a high branch misprediction rate or low IPC. In addition, it might come in handy when you've made a code change and you want to verify that the change has improved performance. Looking at relevant numbers might help you justify or reject the code change. The `perf stat` utility can be used as a lightweight benchmark wrapper. Since the overhead of counting events is minimal, almost all benchmarks can be automatically run under `perf stat`. It serves as a first step in performance investigation. Sometimes anomalies can be spotted right away, which can save you some analysis time. Modern CPUs have hundreds of observable performance events. It's very hard to remember all of them and their meanings. Understanding when to use a particular event is even harder. That is why generally, we don't recommend manually collecting a specific event unless you really know what you are doing. Instead, we recommend using tools like Intel VTune Profiler that automate this process. Nevertheless, there are situations when you are interested in collecting a set of specific performance events. -A complete list of performance events for all Intel CPU generations can be found in [@IntelOptimizationManual, Volume 3B, Chapter 19]. A description is also available online at [perfmon-events.intel.com](https://perfmon-events.intel.com/). Every event is encoded with `Event` and `Umask` hexadecimal values. Sometimes performance events can also be encoded with additional parameters, like `Cmask`, `Inv` and others. An example of encoding two performance events for the Intel Skylake microarchitecture is shown in Table {@tbl:perf_count}. +A complete list of performance events for all Intel CPU generations can be found in [@IntelOptimizationManual, Volume 3B, Chapter 19]. A description is also available online at [perfmon-events.intel.com](https://perfmon-events.intel.com/). Every event is encoded with `Event` and `Umask` hexadecimal values. Sometimes performance events can also be encoded with additional parameters, like `Cmask`, `Inv`, and others. An example of encoding two performance events for the Intel Skylake microarchitecture is shown in Table {@tbl:perf_count}. -------------------------------------------------------------------------- Event Umask Event Mask Description @@ -59,7 +57,7 @@ cache: ... ``` -Linux `perf` provide mappings for the majority of performance events for every popular CPU architecture. For instance, mem_load_retired.l1_hit provides a mapping for `MEM_LOAD_RETIRED.L1_HIT`. If the PMC you are looking for doesn't have a mapping in the Linux perf list of supported events, it can be collected with the following syntax: +Linux `perf` provides mappings for the majority of performance events for every popular CPU architecture. For instance, mem_load_retired.l1_hit provides a mapping for `MEM_LOAD_RETIRED.L1_HIT`. If the PMC you are looking for doesn't have a mapping in the Linux perf list of supported events, it can be collected with the following syntax: ```bash $ perf stat -e cpu/event=0xc4,umask=0x0,name=BR_INST_RETIRED.ALL_BRANCHES/ -- ./a.exe @@ -69,7 +67,7 @@ Performance events are not available in every environment since accessing PMCs r ### Multiplexing and Scaling Events {#sec:secMultiplex} -There are situations when we want to count many different events at the same time. But with only one counter, it's possible to count only one thing at a time. That's why PMUs contain multiple counters (in Intel's recent Goldencove microarchitecture there are 12 programmable PMCs, 6 per hardware thread). Even then, the number of fixed and programmable counter is not always sufficient. Top-down Microarchitecture Analysis (TMA) methodology requires collecting up to 100 different performance events in a single execution of a program. Modern CPUs don't have that many counters, and here is when multiplexing comes into play. +There are situations when we want to count many different events at the same time. But with only one counter, it's possible to count only one thing at a time. That's why PMUs contain multiple counters (in Intel's recent Goldencove microarchitecture there are 12 programmable PMCs, 6 per hardware thread). Even then, the number of fixed and programmable counters is not always sufficient. Top-down Microarchitecture Analysis (TMA) methodology requires collecting up to 100 different performance events in a single execution of a program. Modern CPUs don't have that many counters, and here is when multiplexing comes into play. If you need to collect more events than the number of available PMCs, the analysis tool uses time multiplexing to give each event a chance to access the monitoring hardware. Figure @fig:Multiplexing1 shows an example of multiplexing between 8 performance events with only 4 counters available. @@ -81,7 +79,7 @@ If you need to collect more events than the number of available PMCs, the analys Multiplexing between 8 performance events with only 4 PMCs available. -With multiplexing, an event is not measured all the time, but rather only during a portion of time. At the end of the run, a profiling tool needs to scale the raw count based on total time enabled: +With multiplexing, an event is not measured all the time, but rather only during a portion of time. At the end of the run, a profiling tool needs to scale the raw count based on the total time enabled: $$ final~count = raw~count \times ( time~running / time~enabled ) $$ diff --git a/chapters/5-Performance-Analysis-Approaches/5-4 Marker APIs.md b/chapters/5-Performance-Analysis-Approaches/5-4 Marker APIs.md index eccd71bdb5..22ad59cd63 100644 --- a/chapters/5-Performance-Analysis-Approaches/5-4 Marker APIs.md +++ b/chapters/5-Performance-Analysis-Approaches/5-4 Marker APIs.md @@ -1,14 +1,14 @@ ### Using Marker APIs {#sec:MarkerAPI} -In certain scenarios, we might be interested in analyzing performance of a specific code region, not an entire application. This can be a situation when you're developing a new piece of code and want to focus just on that code. Naturally, you would like to track optimization progress and capture additional performance data that will help you along the way. Most performance analysis tools provide specific *marker APIs* that let you do that. Here are a few examples: +In certain scenarios, we might be interested in analyzing the performance of a specific code region, not an entire application. This can be a situation when you're developing a new piece of code and want to focus just on that code. Naturally, you would like to track optimization progress and capture additional performance data that will help you along the way. Most performance analysis tools provide specific *marker APIs* that let you do that. Here are a few examples: * Likwid has `LIKWID_MARKER_START / LIKWID_MARKER_STOP` macros. * Intel VTune has `__itt_task_begin / __itt_task_end` functions. * AMD uProf has `amdProfileResume / amdProfilePause` functions. -Such a hybrid approach combines benefits of instrumentation and performance events counting. Instead of measuring the whole program, marker APIs allow us to attribute performance statistics to code regions (loops, functions) or functional pieces (remote procedure calls (RPCs), input events, etc.). The quality of the data you get back can easily justify the effort. For example, while investigating a performance bug that happens only with a specific type of RPCs, you can enable monitoring just for that type of RPC. +Such a hybrid approach combines the benefits of instrumentation and performance event counting. Instead of measuring the whole program, marker APIs allow us to attribute performance statistics to code regions (loops, functions) or functional pieces (remote procedure calls (RPCs), input events, etc.). The quality of the data you get back can easily justify the effort. For example, while investigating a performance bug that happens only with a specific type of RPC, you can enable monitoring just for that type of RPC. -Below we provide a very basic example of using [libpfm4](https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/),[^1] one of the popular Linux libraries for collecting performance monitoring events. It is built on top of the Linux `perf_events` subsystem, which lets you access performance event counters directly. The `perf_events` subsystem is rather low-level, so the `libfm4` package is useful here, as it adds both a discovery tool for identifying available events on your CPU, and a wrapper library around the raw `perf_event_open` system call. [@lst:LibpfmMarkerAPI] shows how one can use `libpfm4` to instrument the `render` function of the [C-Ray](https://openbenchmarking.org/test/pts/c-ray)[^2] benchmark. +Below we provide a very basic example of using [libpfm4](https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/),[^1] one of the popular Linux libraries for collecting performance monitoring events. It is built on top of the Linux `perf_events` subsystem, which lets you access performance event counters directly. The `perf_events` subsystem is rather low-level, so the `libfm4` package is useful here, as it adds both a discovery tool for identifying available events on your CPU and a wrapper library around the raw `perf_event_open` system call. [@lst:LibpfmMarkerAPI] shows how one can use `libpfm4` to instrument the `render` function of the [C-Ray](https://openbenchmarking.org/test/pts/c-ray)[^2] benchmark. Listing: Using libpfm4 marker API on the C-Ray benchmark @@ -74,7 +74,7 @@ void render(int xsz, int ysz, uint32_t *fb, int samples) { } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this code example, we first initialize the `libpfm` library and then configure performance events and the format that we will use to read them. In the C-Ray benchmark, the `render` function is only called once. In your own code, be careful about not doing `libpfm` initialization multiple times. Then, we choose the code region we want to analyze, in our case it is a loop with a `trace` function call inside. We surround this code region with two `read` system calls that will capture values of performance counters before and after the loop. Next, we save the deltas for later processing, in this case, we aggregated (code is not shown) it by calculating average, 90th percentile and maximum values. Running it on an Intel Alderlake-based machine, we've got the output shown below. Root privileges are not required, but `/proc/sys/kernel/perf_event_paranoid` should be set to less than 1. When reading counters from inside a thread, the values are for that thread alone. It can optionally include kernel code that ran and was attributed to the thread. +In this code example, we first initialize the `libpfm` library and then configure performance events and the format that we will use to read them. In the C-Ray benchmark, the `render` function is only called once. In your own code, be careful about not doing `libpfm` initialization multiple times. Then, we choose the code region we want to analyze, in our case it is a loop with a `trace` function call inside. We surround this code region with two `read` system calls that will capture values of performance counters before and after the loop. Next, we save the deltas for later processing, in this case, we aggregated (code is not shown) it by calculating average, 90th percentile, and maximum values. Running it on an Intel Alderlake-based machine, we've got the output shown below. Root privileges are not required, but `/proc/sys/kernel/perf_event_paranoid` should be set to less than 1. When reading counters from inside a thread, the values are for that thread alone. It can optionally include kernel code that ran and was attributed to the thread. ```bash $ ./c-ray-f -s 1024x768 -r 2 -i sphfract -o output.ppm @@ -90,25 +90,25 @@ branch-misses | 18 | 35 | 146 Remember, that our instrumentation measures the per-pixel ray tracing stats. Multiplying average numbers by the number of pixels (`1024x768`) should give us roughly the total stats for the program. A good sanity check in this case is to run `perf stat` and compare the overall C-Ray statistics for the performance events that we've collected. -The C-ray benchmark primarily stresses the floating-point performance of a CPU core, which generally should not cause high variance in the measurements, in other words, we expect all the measurements to be very close to each other. However, we see that it's not the case, as p90 values are 1.33x average numbers and max is sometimes 5x slower than the average case. The most likely explanation here is that for some pixels the algorithm hits a corner case, executes more instructions and subsequently runs longer. But it's always good to confirm a hypothesis by studying the source code or extending the instrumentation to capture more data for the "slow" pixels. +The C-ray benchmark primarily stresses the floating-point performance of a CPU core, which generally should not cause high variance in the measurements, in other words, we expect all the measurements to be very close to each other. However, we see that it's not the case, as p90 values are 1.33x average numbers and max is sometimes 5x slower than the average case. The most likely explanation here is that for some pixels the algorithm hits a corner case, executes more instructions, and subsequently runs longer. But it's always good to confirm a hypothesis by studying the source code or extending the instrumentation to capture more data for the "slow" pixels. -The additional instrumentation code showed in [@lst:LibpfmMarkerAPI] causes 17% overhead, which is OK for local experiments, but quite high to run in production. Most large distributed systems aim for less than 1% overhead, and for some up to 5% can be tolerable, but it's unlikely that users would be happy with 17% slowdown. Managing the overhead of your instrumentation is critical, especially if you choose to enable it in a production environment. +The additional instrumentation code shown in [@lst:LibpfmMarkerAPI] causes 17% overhead, which is OK for local experiments, but quite high to run in production. Most large distributed systems aim for less than 1% overhead, and for some up to 5% can be tolerable, but it's unlikely that users would be happy with a 17% slowdown. Managing the overhead of your instrumentation is critical, especially if you choose to enable it in a production environment. -Overhead is usefully calculated as occurrence rate per unit of time or work (RPC, database query, loop iteration, etc.). If a read system call on our system takes roughly 1.6 microseconds of CPU time, and we call it twice for each pixel (iteration of the outer loop), the overhead is 3.2 microseconds of CPU time per pixel. +Overhead is usefully calculated as the occurrence rate per unit of time or work (RPC, database query, loop iteration, etc.). If a read system call on our system takes roughly 1.6 microseconds of CPU time, and we call it twice for each pixel (iteration of the outer loop), the overhead is 3.2 microseconds of CPU time per pixel. -There are many strategies to bring the overhead down. As a general rule, your instrumentation should always have a fixed cost, e.g., a deterministic syscall, but not a list traversal or dynamic memory allocation, otherwise it will interfere with the measurements. The instumentation code has three logical parts: collecting the information, storing it, and reporting it. To lower the overhead of the first part (collection), we can decrease the sampling rate, e.g., sample each 10th RPC and skip the rest. For a long-running application, performance can be monitored with a relatively cheap random sampling, i.e., randomly select which events to observe for each sample. These methods sacrifice collection accuracy but still provide a good estimate of the overall performance characteristics while incurring a very low overhead. +There are many strategies to bring the overhead down. As a general rule, your instrumentation should always have a fixed cost, e.g., a deterministic syscall, but not a list traversal or dynamic memory allocation, otherwise, it will interfere with the measurements. The instrumentation code has three logical parts: collecting the information, storing it, and reporting it. To lower the overhead of the first part (collection), we can decrease the sampling rate, e.g., sample each 10th RPC and skip the rest. For a long-running application, performance can be monitored with a relatively cheap random sampling, i.e., randomly select which events to observe for each sample. These methods sacrifice collection accuracy but still provide a good estimate of the overall performance characteristics while incurring a very low overhead. -For the second and third parts (storing and aggregating), the recommendation is to collect, processes, and retain only as much data as you need to understand the performance of the system. You can avoid storing every sample in memory by using "online" algorithms for calculating mean, variance, min, max and other metrics. This will drastically reduce the memory footprint of the instrumentation. For instance, variance and standard deviation can be calculated using Knuth's online-variance algorithm. A good implementation[^3] uses less than 50 bytes of memory. +For the second and third parts (storing and aggregating), the recommendation is to collect, process, and retain only as much data as you need to understand the performance of the system. You can avoid storing every sample in memory by using "online" algorithms for calculating mean, variance, min, max, and other metrics. This will drastically reduce the memory footprint of the instrumentation. For instance, variance and standard deviation can be calculated using Knuth's online-variance algorithm. A good implementation[^3] uses less than 50 bytes of memory. -For long routines, you can collect counters at the beginning, end, and some parts in the middle. Over consequtive runs, you can binary search for the part of the routine that performs poorest and optimize it. Repeat this until all the poorly performing spots are removed. If tail latency is of a primary concern, emitting log messages on a particularly slow run can provide useful insights. +For long routines, you can collect counters at the beginning and end, and some parts in the middle. Over consecutive runs, you can binary search for the part of the routine that performs poorest and optimize it. Repeat this until all the poorly performing spots are removed. If tail latency is of primary concern, emitting log messages on a particularly slow run can provide useful insights. In [@lst:LibpfmMarkerAPI], we collected 4 events simultaneously, though the CPU has 6 programmable counters. You can open up additional groups with different sets of events enabled. The kernel will select different groups to run at a time. The `time_enabled` and `time_running` fields indicate the multiplexing. They both indicate duration in nanoseconds. The `time_enabled` field indicates how many nanoseconds the event group has been enabled. The `time_running` field indicates how much of that enabled time the events have been collected. If you had two event groups enabled simultaneously that couldn't fit together on the hardware counters, you might see the running time for both groups converge to `time_running = 0.5 * time_enabled`. -Capturing multiple events simultaneously makes it possible to calculate various metrics that we discussed in Chapter 4. For example, capturing `INSTRUCTIONS_RETIRED` and `UNHALTED_CLOCK_CYCLES` enables us to measure IPC. We can observe the effects of frequency scaling by comparing CPU cycles (`UNHALTED_CORE_CYCLES`) with the fixed-frequency reference clock (`UNHALTED_REFERENCE_CYCLES`). It is possible to detect when the thread wasn't running by requesting CPU cycles consumed (`UNHALTED_CORE_CYCLES`, only counts when the thread is running) and comparing against wall-clock time. Also, we can normalize the numbers to get the event rate per second/clock/instruction. For instance, measuring `MEM_LOAD_RETIRED.L3_MISS` and `INSTRUCTIONS_RETIRED` we can get the `L3MPKI` metric. As you can see, the setup is very flexible. +Capturing multiple events simultaneously makes it possible to calculate various metrics that we discussed in Chapter 4. For example, capturing `INSTRUCTIONS_RETIRED` and `UNHALTED_CLOCK_CYCLES` enables us to measure IPC. We can observe the effects of frequency scaling by comparing CPU cycles (`UNHALTED_CORE_CYCLES`) with the fixed-frequency reference clock (`UNHALTED_REFERENCE_CYCLES`). It is possible to detect when the thread wasn't running by requesting CPU cycles consumed (`UNHALTED_CORE_CYCLES`, only counts when the thread is running) and comparing it against wall-clock time. Also, we can normalize the numbers to get the event rate per second/clock/instruction. For instance, by measuring `MEM_LOAD_RETIRED.L3_MISS` and `INSTRUCTIONS_RETIRED` we can get the `L3MPKI` metric. As you can see, the setup is very flexible. The important property of grouping events is that the counters will be available atomically under the same `read` system call. These atomic bundles are very useful. First, it allows us to correlate events within each group. For example, let's assume we measure IPC for a region of code, and found that it is very low. In this case, we can pair two events (instructions and cycles) with a third one, say L3 cache misses, to check if it contributes to the low IPC that we're dealing with. If it doesn't, we can continue factor analysis using other events. Second, event grouping helps to mitigate bias in case a workload has different phases. Since all the events within a group are measured at the same time, they always capture the same phase. -In some scenarios, instrumentation may become a part of a functionality or a feature. For example, a developer can implement an instrumentation logic that detects decrease in IPC (e.g., when there is a busy sibling hardware thread running) or decreasing CPU frequency (e.g., system throttling due to heavy load). When such event occurs, application automatically defers low-priority work to compensate for the temporarily increased load. +In some scenarios, instrumentation may become a part of a functionality or a feature. For example, a developer can implement an instrumentation logic that detects a decrease in IPC (e.g., when there is a busy sibling hardware thread running) or decreasing CPU frequency (e.g., system throttling due to heavy load). When such an event occurs, the application automatically defers low-priority work to compensate for the temporarily increased load. [^1]: libpfm4 - [https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/](https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/) diff --git a/chapters/5-Performance-Analysis-Approaches/5-5 Sampling.md b/chapters/5-Performance-Analysis-Approaches/5-5 Sampling.md index 54d27e88bc..c806cec4f6 100644 --- a/chapters/5-Performance-Analysis-Approaches/5-5 Sampling.md +++ b/chapters/5-Performance-Analysis-Approaches/5-5 Sampling.md @@ -1,12 +1,10 @@ - - ## Sampling {#sec:profiling} Sampling is the most frequently used approach for doing performance analysis. People usually associate it with finding hotspots in the program. To put it more broadly, sampling helps to find places in the code that contribute to the highest number of certain performance events. If we consider finding hotspots, the problem can be reformulated as which place in the code consumes the biggest amount of CPU cycles. People often use the term "Profiling" for what is technically called sampling. According to [Wikipedia](https://en.wikipedia.org/wiki/Profiling_(computer_programming)),[^1] profiling is a much broader term and includes a wide variety of techniques to collect data, including interrupts, code instrumentation, and PMC. It may come as a surprise, but the simplest sampling profiler one can imagine is a debugger. In fact, you can identify hotspots as follows: a) run the program under the debugger; b) pause the program every 10 seconds; and c) record the place where it stopped. If you repeat b) and c) many times, you can build a histogram from those samples. The line of code where you stopped the most will be the hottest place in the program. Of course, this is not an efficient way to find hotspots, and we don't recommend doing this. It's just to illustrate the concept. Nevertheless, this is a simplified description of how real profiling tools work. Modern profilers are capable of collecting thousands of samples per second, which gives a pretty accurate estimate of the hottest places in a program. -As in the example with a debugger, the execution of the analyzed program is interrupted every time a new sample is captured. At the time of an interrupt, the profiler collects the snapshot of the program state, which constitutes one sample. Information collected for every sample may include an instruction address that was executed at the time of the interrupt, register state, call stack (see [@sec:secCollectCallStacks]), etc. Collected samples are stored in a dump file, which can be further used to display most time-consuming parts of the program, a call graph, etc. +As in the example with a debugger, the execution of the analyzed program is interrupted every time a new sample is captured. At the time of an interrupt, the profiler collects the snapshot of the program state, which constitutes one sample. Information collected for every sample may include an instruction address that was executed at the time of the interrupt, register state, call stack (see [@sec:secCollectCallStacks]), etc. Collected samples are stored in a dump file, which can be further used to display the most time-consuming parts of the program, a call graph, etc. ### User-Mode and Hardware Event-based Sampling @@ -22,7 +20,7 @@ In this section, we will discuss the mechanics of using PMCs with EBS. Figure @f ![Using performance counter for sampling](../../img/perf-analysis/SamplingFlow.png){#fig:Sampling width=80%} -After we have initialized the register, we start counting and let the benchmark run. Since we have configured a PMC to count cycles, it will be incremented every cycle. Eventually, it will overflow. At the time the register overflows, hardware will raise a PMI. The profiling tool is configured to capture PMIs and has an Interrupt Service Routine (ISR) for handling them. We do multiple steps inside the ISR: first of all, we disable counting; after that, we record the instruction which was executed by the CPU at the time the counter overflowed; then, we reset the counter to `N` and resume the benchmark. +After we have initialized the register, we start counting and let the benchmark run. Since we have configured a PMC to count cycles, it will be incremented every cycle. Eventually, it will overflow. At the time the register overflows, the hardware will raise a PMI. The profiling tool is configured to capture PMIs and has an Interrupt Service Routine (ISR) for handling them. We do multiple steps inside the ISR: first of all, we disable counting; after that, we record the instruction that was executed by the CPU at the time the counter overflowed; then, we reset the counter to `N` and resume the benchmark. Now, let us go back to the value `N`. Using this value, we can control how frequently we want to get a new interrupt. Say we want a finer granularity and have one sample every 1 million instructions. To achieve this, we can set the counter to `(unsigned) -1'000'000` so that it will overflow after every 1 million instructions. This value is also referred to as the "sample after" value. @@ -50,11 +48,11 @@ $ perf report -n --stdio ... ``` -Linux perf collected `35,035` samples, which means that there were the same number of process interrupts. We also used `-F 1000` which sets the sampling rate at 1000 samples per second. This roughly matches the overall runtime of 36.2 seconds. Notice, Linux perf provided the approximate number of total cycles elapsed. If we divide it by the number of samples, we'll have `156756064947 cycles / 35035 samples = 4.5 million` cycles per sample. That means that Linux perf set the number `N` to roughly `4'500'000` to collect 1000 samples per second. The number `N` can be adjusted by the tool dynamically according to the actual CPU frequency. +Linux perf collected `35,035` samples, which means that there were the same number of process interrupts. We also used `-F 1000` which sets the sampling rate at 1000 samples per second. This roughly matches the overall runtime of 36.2 seconds. Notice, that Linux perf provided the approximate number of total cycles elapsed. If we divide it by the number of samples, we'll have `156756064947 cycles / 35035 samples = 4.5 million cycles` per sample. That means that Linux perf set the number `N` to roughly `4'500'000` to collect 1000 samples per second. The number `N` can be adjusted by the tool dynamically according to the actual CPU frequency. -And of course, most valuable for us is the list of hotspots sorted by the number of samples attributed to each function. After we know what are the hottest functions, we may want to look one level deeper: what are the hot parts of code inside every function. To see the profiling data for functions that were inlined as well as assembly code generated for a particular source code region, we need to build the application with debug information (`-g` compiler flag). +And of course, most valuable for us is the list of hotspots sorted by the number of samples attributed to each function. After we know what are the hottest functions, we may want to look one level deeper: what are the hot parts of code inside every function? To see the profiling data for functions that were inlined as well as assembly code generated for a particular source code region, we need to build the application with debug information (`-g` compiler flag). -There are two main uses cases for the debug information: debugging a functional issue (a bug) and performance analysis. For functional debugging we need as much information as possible, which is the default when you pass `-g` compiler flag. However, if a user doesn't need full debug experience, having line numbers is enough for performance profiling. You can reduce the amount of generated debug information to just line numbers of symbols as they appear in the source code by using the `-gline-tables-only` option.[^4] +There are two main use cases for the debug information: debugging a functional issue (a bug) and performance analysis. For functional debugging, we need as much information as possible, which is the default when you pass the `-g` compiler flag. However, if a user doesn't need full debug experience, having line numbers is enough for performance profiling. You can reduce the amount of generated debug information to just line numbers of symbols as they appear in the source code by using the `-gline-tables-only` option.[^4] Linux `perf` doesn't have rich graphic support, so viewing hot parts of source code is not very convenient, but doable. Linux `perf` intermixes source code with the generated assembly, as shown below: @@ -74,9 +72,9 @@ Percent | Source code & Disassembly of x264 for cycles:ppp ... ``` -Most profilers with a Graphical User Interface (GUI), like Intel VTune Profiler, can show source code and associated assembly side-by-side. Also, there are tools that can visualize the output of Linux `perf` raw data with a rich graphical interface similar to Intel VTune and other tools. You'll see all that in more details in [@sec:secOverviewPerfTools]. +Most profilers with a Graphical User Interface (GUI), like Intel VTune Profiler, can show source code and associated assembly side-by-side. Also, there are tools that can visualize the output of Linux `perf` raw data with a rich graphical interface similar to Intel VTune and other tools. You'll see all that in more detail in [@sec:secOverviewPerfTools]. -Sampling gives a good statistical representation of a program's execution, however, one of the downsides of this technique is that it has blind spots and is not suitable for detecting abnormal behaviors. Each sample represents an aggregated view of a portion of a program's execution. Aggregation doesn't give us enough details of what exactly happened during that time interval. We cannot zoom in to a particular time interval to learn more about execution nuances. When we squash time intervals into samples, we lose valuable information and it becomes useless for analyzing events with a very short duration. Increasing sampling interval, e.g., more than 1000 samples per second may give you a better picture, but may still not be enough. As a solution, you should use tracing. +Sampling gives a good statistical representation of a program's execution, however, one of the downsides of this technique is that it has blind spots and is not suitable for detecting abnormal behaviors. Each sample represents an aggregated view of a portion of a program's execution. Aggregation doesn't give us enough details of what exactly happened during that time interval. We cannot zoom in to a particular time interval to learn more about execution nuances. When we squash time intervals into samples, we lose valuable information and it becomes useless for analyzing events with a very short duration. Increasing the sampling interval, e.g., more than 1000 samples per second may give you a better picture, but may still not be enough. As a solution, you should use tracing. ### Collecting Call Stacks {#sec:secCollectCallStacks} @@ -88,11 +86,11 @@ Analyzing the source code of all the callers of `foo` might be very time-consumi Collecting call stacks in Linux `perf` is possible with three methods: -1. Frame pointers (`perf record --call-graph fp`). It requires binary to be built with `--fnoomit-frame-pointer`. Historically, the frame pointer (`RBP` register) was used for debugging since it enables us to get the call stack without popping all the arguments from the stack (also known as stack unwinding). The frame pointer can tell the return address immediately. It enables very cheap stack unwinding, which reduces profiling overhead, however, it consumes one additional register just for this purpose. At the time when the number of architectural register was small, using frame pointers was expensive in terms of runtime performance. Nowadays, community moves back to using frame pointers, because it provides better quality call stacks and low profiling overhead. -2. DWARF debug info (`perf record --call-graph dwarf`). It requires binary to be built with DWARF debug information `-g` (`-gline-tables-only`). It also obtains call stacks through stack unwinding procedure, but this method is more expensive than using frame pointers. -3. Intel Last Branch Record (LBR). This makes use of a hardware feature, and is accessed with the following command: `perf record --call-graph lbr`. It obtains call stacks by parsing the LBR stack (a set of hardware registers). The resulting call graph is not as deep as those produced by the first two methods. See more information about LBR in [@sec:lbr]. +1. Frame pointers (`perf record --call-graph fp`). It requires binary to be built with `--fnoomit-frame-pointer`. Historically, the frame pointer (`RBP` register) was used for debugging since it enables us to get the call stack without popping all the arguments from the stack (also known as stack unwinding). The frame pointer can tell the return address immediately. It enables very cheap stack unwinding, which reduces profiling overhead, however, it consumes one additional register just for this purpose. At the time when the number of architectural registers was small, using frame pointers was expensive in terms of runtime performance. Nowadays, the community moves back to using frame pointers, because it provides better quality call stacks and low profiling overhead. +2. DWARF debug info (`perf record --call-graph dwarf`). It requires binary to be built with DWARF debug information `-g` (`-gline-tables-only`). It also obtains call stacks through the stack unwinding procedure, but this method is more expensive than using frame pointers. +3. Intel Last Branch Record (LBR). This makes use of a hardware feature, and is accessed with the following command: `perf record --call-graph lbr`. It obtains call stacks by parsing the LBR stack (a set of hardware registers). The resulting call graph is not as deep as those produced by the first two methods. See more information about LBR in [@sec:lbr]. -Below is an example of collecting call stacks in a program using LBR. By looking at the output, we know that 55% of the time `foo` was called from `func1`, 33% of the time from `func2` and 11% from `fun3`. We can clearly see the distribution of the overhead between callers of `foo` and can now focus our attention on the hottest edge in the CFG of the program, which is `func1 -> foo`, but we should probably also pay attention to the edge `func2 -> foo`. +Below is an example of collecting call stacks in a program using LBR. By looking at the output, we know that 55% of the time `foo` was called from `func1`, 33% of the time from `func2`, and 11% from `fun3`. We can clearly see the distribution of the overhead between callers of `foo` and can now focus our attention on the hottest edge in the CFG of the program, which is `func1 -> foo`, but we should probably also pay attention to the edge `func2 -> foo`. ```bash $ perf record --call-graph lbr -- ./a.out @@ -124,6 +122,6 @@ $ perf report -n --stdio --no-children When using Intel VTune Profiler, one can collect call stacks data by checking the corresponding "Collect stacks" box while configuring analysis. When using the command-line interface, specify the `-knob enable-stack-collection=true` option. [^1]: Profiling(wikipedia) - [https://en.wikipedia.org/wiki/Profiling_(computer_programming)](https://en.wikipedia.org/wiki/Profiling_(computer_programming)). -[^4]: In the past there were LLVM compiler bugs when compiling with debug info (`-g`). Code transformation passes incorrectly treated the presence of debugging intrinsics which caused different optimizations decisions. It did not affect functionality, only performance. Some of them were fixed, but it's hard to say if any of them are still there. +[^4]: In the past there were LLVM compiler bugs when compiling with debug info (`-g`). Code transformation passes incorrectly treated the presence of debugging intrinsics which caused different optimization decisions. It did not affect functionality, only performance. Some of them were fixed, but it's hard to say if any of them are still there. [^7]: x264 benchmark - [https://openbenchmarking.org/test/pts/x264](https://openbenchmarking.org/test/pts/x264). [^8]: Phoronix test suite - [https://www.phoronix-test-suite.com/](https://www.phoronix-test-suite.com/). diff --git a/chapters/5-Performance-Analysis-Approaches/~$6 Roofline.md b/chapters/5-Performance-Analysis-Approaches/~$6 Roofline.md new file mode 100644 index 0000000000..20c891fedf Binary files /dev/null and b/chapters/5-Performance-Analysis-Approaches/~$6 Roofline.md differ