diff --git a/chapters/9-Optimizing-Computations/9-4 Vectorization.md b/chapters/9-Optimizing-Computations/9-4 Vectorization.md index 8b9c036a0b..3ebd3f2bec 100644 --- a/chapters/9-Optimizing-Computations/9-4 Vectorization.md +++ b/chapters/9-Optimizing-Computations/9-4 Vectorization.md @@ -6,11 +6,11 @@ On modern processors, the use of SIMD instructions can result in a great speedup Often vectorization happens automatically without any user intervention, this is called autovectorization. In such a situation, a compiler automatically recognizes the opportunity to produce SIMD machine code from the source code. Autovectorization could be a convenient solution because modern compilers generate fast vectorized code for a wide variety of programs. -However, in some cases, autovectorization does not succeed without intervention by the software engineer, perhaps based on feedback[^2] they get from, say, compiler optimization reports or profiling data. In such cases, programmers need to tell the compiler that a particular code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the autovectorization process and make sure that certain parts of the code are vectorized efficiently. However, this control is limited. We will provide several examples of using compiler hints in the subsequent sections. +However, in some cases, auto-vectorization does not succeed without intervention by the software engineer, perhaps based on feedback[^2] they get from, say, compiler optimization reports or profiling data. In such cases, programmers need to tell the compiler that a particular code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the auto-vectorization process and make sure that certain parts of the code are vectorized efficiently. However, this control is limited. We will provide several examples of using compiler hints in the subsequent sections. -It is important to note that there is a range of problems where SIMD is important and where autovectorization just does not work and is not likely to work in the near future. One example can be found in [@Mula_Lemire_2019]. Outer loop autovectorization is not currently attempted by compilers. They are less likely to vectorize floating-point code because results will differ numerically. Code involving permutations or shuffles across vector lanes is also less likely to autovectorize, and this is likely to remain difficult for compilers. +It is important to note that there is a range of problems where SIMD is important and where auto-vectorization just does not work and is not likely to work in the near future. One example can be found in [@Mula_Lemire_2019]. Outer loop autovectorization is not currently attempted by compilers. They are less likely to vectorize floating-point code because results will differ numerically. Code involving permutations or shuffles across vector lanes is also less likely to auto-vectorize, and this is likely to remain difficult for compilers. -There is one more subtle problem with autovectorization. As compilers evolve, optimizations that they make are changing. The successful autovectorization of code that was done in the previous compiler version may stop working in the next version, or vice versa. Also, during code maintenance or refactoring, the structure of the code may change, such that autovectorization suddenly starts failing. This may occur long after the original software was written, so it would be more expensive to fix or redo the implementation at this point. +There is one more subtle problem with autovectorization. As compilers evolve, optimizations that they make are changing. The successful auto-vectorization of code that was done in the previous compiler version may stop working in the next version, or vice versa. Also, during code maintenance or refactoring, the structure of the code may change, such that autovectorization suddenly starts failing. This may occur long after the original software was written, so it would be more expensive to fix or redo the implementation at this point. When it is absolutely necessary to generate specific assembly instructions, one should not rely on compiler autovectorization. In such cases, code can instead be written using compiler intrinsics, which we will discuss in [@sec:secIntrinsics]. In most cases, compiler intrinsics provide a 1-to-1 mapping to assembly instructions. Intrinsics are somewhat easier to use than inline assembly because the compiler takes care of register allocation, and they allow the programmer to retain considerable control over code generation. However, they are still often verbose and difficult to read and subject to behavioral differences or even bugs in various compilers. @@ -20,9 +20,9 @@ Note that when using intrinsics or a wrapper library, it is still advisable to w In the remainder of this section, we will discuss several of these approaches, especially inner loop vectorization because it is the most common type of autovectorization. The other two types, outer loop vectorization, and SLP (Superword-Level Parallelism) vectorization, are mentioned in Appendix B. -### Compiler Autovectorization. +### Compiler Auto-Vectorization. -Multiple hurdles can prevent auto-vectorization, some of which are inherent to the semantics of programming languages. For example, the compiler must assume that unsigned loop indices may overflow, and this can prevent certain loop transformations. Another example is the assumption that the C programming language makes: pointers in the program may point to overlapping memory regions, which can make the analysis of the program very difficult. Another major hurdle is the design of the processor itself. In some cases, processors don’t have efficient vector instructions for certain operations. For example, performing predicated (bitmask-controlled) load and store operations are not available on most processors. Another example is vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes. Despite all of the challenges, the software developer can work around many of the challenges and enable vectorization. Later in this section, we provide guidance on how to work with the compiler and ensure that the hot code is vectorized by the compiler. +Multiple hurdles can prevent auto-vectorization, some of which are inherent to the semantics of programming languages. For example, the compiler must assume that unsigned loop indices may overflow, and this can prevent certain loop transformations. Another example is the assumption that the C programming language makes: pointers in the program may point to overlapping memory regions, which can make the analysis of the program very difficult. Another major hurdle is the design of the processor itself. In some cases, processors don’t have efficient vector instructions for certain operations. For example, predicated (bitmask-controlled) load and store operations are not available on most processors. Another example is vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes. Despite all of the challenges, the software developer can work around many of the challenges and enable vectorization. Later in this section, we provide guidance on how to work with the compiler and ensure that the hot code is vectorized by the compiler. The vectorizer is usually structured in three phases: legality-check, profitability-check, and transformation itself: @@ -38,7 +38,7 @@ The vectorizer is usually structured in three phases: legality-check, profitabil Discovering opportunities for improving vectorization should start by analyzing hot loops in the program and checking what optimizations were performed by the compiler. Checking compiler vectorization remarks (see [@sec:compilerOptReports]) is the easiest way to know that. Modern compilers can report whether a certain loop was vectorized, and provide additional details, e.g., vectorization factor (VF). In the case when the compiler cannot vectorize a loop, it is also able to tell the reason why it failed. -An alternative way to use compiler optimization reports is to check assembly output. It is best to analyze the output from a profiling tool that shows the correspondence between the source code and generated assembly instructions for a given loop. That way you only focus on the code that matters, i.e., the hot code. However, understanding assembly language is much more difficult than a high-level language like C++. It may take some time to figure out the semantics of the instructions generated by the compiler. However, this skill is highly rewarding and often provides valuable insights. Experienced developers can quickly tell whether the code was vectorized or not just by looking at instruction mnemonics and the register names used by those instructions. For example, in x86 ISA, vector instructions operate on packed data (thus have `P` in their name) and use `XMM`, `YMM`, or `ZMM` registers, e.g., `VMULPS XMM1, XMM2, XMM3` multiplies four single precision floats in `XMM2` and `XMM3` and saves the result in `XMM1`. But be careful, often people conclude from seeing `XMM` register being used, that it is vector code -- not necessary. For instance, the `VMULSS` instruction will only multiply one single-precision floating-point value, not four. +An alternative way to use compiler optimization reports is to check assembly output. It is best to analyze the output from a profiling tool that shows the correspondence between the source code and generated assembly instructions for a given loop. That way you only focus on the code that matters, i.e., the hot code. However, understanding assembly language is much more difficult than a high-level language like C++. It may take some time to figure out the semantics of the instructions generated by the compiler. However, this skill is highly rewarding and often provides valuable insights. Experienced developers can quickly tell whether the code was vectorized or not just by looking at instruction mnemonics and the register names used by those instructions. For example, in x86 ISA, vector instructions operate on packed data (thus have `P` in their name) and use `XMM`, `YMM`, or `ZMM` registers, e.g., `VMULPS XMM1, XMM2, XMM3` multiplies four single precision floats in `XMM2` and `XMM3` and saves the result in `XMM1`. But be careful, often people conclude from seeing the `XMM` register being used, that it is vector code -- not necessary. For instance, the `VMULSS` instruction will only multiply one single-precision floating-point value, not four. There are a few common cases that developers frequently run into when trying to accelerate vectorizable code. Below we present four typical scenarios and give general guidance on how to proceed in each case. @@ -55,7 +55,7 @@ void vectorDependence(int *A, int n) { } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -While some loops cannot be vectorized due to the hard limitations described above, others could be vectorized when certain constraints are relaxed. There are situations when the compiler cannot vectorize a loop because it simply cannot prove it is legal to do so. Compilers are generally very conservative and only do transformations when they are sure it doesn't break the code. Such soft limitations could be relaxed by providing additional hints to the compiler. For example, when transforming the code that performs floating-point arithmetic, vectorization may change the behavior of the program. The floating-point addition and multiplication are commutative, which means that you can swap the left-hand side and the right-hand side without changing the result: `(a + b == b + a)`. However, these operations are not associative, because rounding happens at different times: `((a + b) + c) != (a + (b + c))`. The code in [@lst:VectIllegal] cannot be auto vectorized by the compiler. The reason is that vectorization would change the variable sum into a vector accumulator, and this will change the order of operations and may lead to different rounding decisions and a different result. +While some loops cannot be vectorized due to the hard limitations described above, others could be vectorized when certain constraints are relaxed. There are situations when the compiler cannot vectorize a loop because it simply cannot prove it is legal to do so. Compilers are generally very conservative and only do transformations when they are sure it doesn't break the code. Such soft limitations could be relaxed by providing additional hints to the compiler. For example, when transforming the code that performs floating-point arithmetic, vectorization may change the behavior of the program. The floating-point addition and multiplication are commutative, which means that you can swap the left-hand side and the right-hand side without changing the result: `(a + b == b + a)`. However, these operations are not associative, because rounding happens at different times: `((a + b) + c) != (a + (b + c))`. The code in [@lst:VectIllegal] cannot be auto-vectorized by the compiler. The reason is that vectorization would change the variable sum into a vector accumulator, and this will change the order of operations and may lead to different rounding decisions and a different result. Listing: Vectorization: floating-point arithmetic. @@ -83,7 +83,7 @@ a.cpp:4:3: remark: vectorized loop (vectorization width: 4, interleaved count: 2 ... ``` -Unfortunately this flag involves subtle and potentially dangerous behavior changes, including for Not-a-Number, signed zero, infinity and subnormals. Because third-party code may not be ready for these effects, this flag should not be enabled across large sections of code without careful validation of the results, including for edge cases. +Unfortunately, this flag involves subtle and potentially dangerous behavior changes, including for Not-a-Number, signed zero, infinity, and subnormals. Because third-party code may not be ready for these effects, this flag should not be enabled across large sections of code without careful validation of the results, including for edge cases. Let's look at another typical situation when a compiler may need support from a developer to perform vectorization. When compilers cannot prove that a loop operates on arrays with non-overlapping memory regions, they usually choose to be on the safe side. Let's revisit the example from [@lst:optReport] provided in [@sec:compilerOptReports]. When the compiler tries to vectorize the code presented in [@lst:OverlappingMemRefions], it generally cannot do this because the memory regions of arrays `a`, `b`, and `c` can overlap. @@ -102,13 +102,13 @@ Here is the optimization report (enabled with `-fopt-info`) provided by GCC 10.2 ```bash $ gcc -O3 -march=core-avx2 -fopt-info -a.cpp:2:26: optimized: loop vectorized using 32 byte vectors +a.cpp:2:26: optimized: loop vectorized using 32-byte vectors a.cpp:2:26: optimized: loop versioned for vectorization because of possible aliasing ``` -GCC has recognized potential overlap between memory regions of arrays `a`, `b`, and `c`, and created multiple versions of the same loop. The compiler inserted runtime checks[^36] for detecting if the memory regions overlap. Based on those checks, it dispatches between vectorized and scalar[^35] versions. In this case, vectorization comes with the cost of inserting potentially expensive runtime checks. If a developer knows that memory regions of arrays `a`, `b`, and `c` do not overlap, it can insert `#pragma GCC ivdep`[^37] right before the loop or use the `__restrict__ ` keyword as shown in [@lst:optReport2]. Such compiler hints will eliminate the need for the GCC compiler to insert the runtime checks mentioned earlier. +GCC has recognized potential overlap between memory regions of arrays `a`, `b`, and `c`, and created multiple versions of the same loop. The compiler inserted runtime checks[^36] to detect if the memory regions overlap. Based on those checks, it dispatches between vectorized and scalar[^35] versions. In this case, vectorization comes with the cost of inserting potentially expensive runtime checks. If a developer knows that memory regions of arrays `a`, `b`, and `c` do not overlap, it can insert `#pragma GCC ivdep`[^37] right before the loop or use the `__restrict__ ` keyword as shown in [@lst:optReport2]. Such compiler hints will eliminate the need for the GCC compiler to insert the runtime checks mentioned earlier. -By their nature, compilers are static tools: they only reason based on the code they work with. For example, some of the dynamic tools, such as Intel Advisor, can detect if issues like cross-iteration dependence or access to arrays with overlapping memory regions actually occur in a given loop. But be aware that such tools only provide a suggestion. Carelessly inserting compiler hints can cause real problems. +By their nature, compilers are static tools: they only reason based on the code they work with. For example, some of the dynamic tools, such as Intel Advisor, can detect if issues like cross-iteration dependence or access to arrays with overlapping memory regions occur in a given loop. But be aware that such tools only provide a suggestion. Carelessly inserting compiler hints can cause real problems. #### Vectorization Is Not Beneficial. @@ -133,7 +133,7 @@ a.cpp:3:3: remark: the cost-model indicates that vectorization is not beneficial ^ ``` -Users can force the Clang compiler to vectorize the loop by using the `#pragma` hint, as shown in [@lst:VectNotProfitOverriden]. However, keep in mind that the true fact of whether vectorization is profitable or not largely depends on the runtime data, for example, the number of iterations of the loop. Compilers don't have this information available,[^1] so they often tend to be conservative. Developers can use such hints when searching for performance headrooms. +Users can force the Clang compiler to vectorize the loop by using the `#pragma` hint, as shown in [@lst:VectNotProfitOverriden]. However, keep in mind that whether vectorization is profitable largely depends on the runtime data, for example, the number of iterations of the loop. Compilers don't have this information available,[^1] so they often tend to be conservative. Developers can use such hints when searching for performance headrooms. Listing: Vectorization: not beneficial. @@ -146,7 +146,7 @@ void stridedLoads(int *A, int *B, int n) { } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Developers should be aware of the hidden cost of using vectorized code. Using AVX and especially AVX-512 vector instructions could lead to frequency downclocking or startup overhead, which on certain CPUs can also affect subsequent code over a period of several microseconds. The vectorized portion of the code should be hot enough to justify using AVX-512.[^38] For example, sorting 80 KiB was found to be sufficient to amortize this overhead and make vectorization worthwhile.[^39] +Developers should be aware of the hidden cost of using vectorized code. Using AVX and especially AVX-512 vector instructions could lead to frequency downclocking or startup overhead, which on certain CPUs can also affect subsequent code for several microseconds. The vectorized portion of the code should be hot enough to justify using AVX-512.[^38] For example, sorting 80 KiB was found to be sufficient to amortize this overhead and make vectorization worthwhile.[^39] #### Loop Vectorized but Scalar Version Used. @@ -154,7 +154,7 @@ In some cases, the compiler can successfully vectorize the code, but the vectori If the generated code is not executed, one possible reason for this is that the code that the compiler has generated assumes loop trip counts that are higher than what the program uses. For example, to vectorize efficiently on a modern CPU, programmers need to vectorize and utilize AVX2 and also unroll the loop 4-5 times to generate enough work for the pipelined FMA units. This means that each loop iteration needs to process around 40 elements. Many loops may run with loop trip counts that are below this value and may fall back to using the scalar remainder loop. It is easy to detect these cases because the scalar remainder loop would light up in the profiler, and the vectorized code would remain cold. -The solution to this problem is to force the vectorizer to use a lower vectorization factor or unroll count, to reduce the number of elements that loops process and enable more loops with lower trip counts to visit the fast vectorized loop body. Developers can achieve that with the help of `#pragma` hints. For the Clang compiler, one can use `#pragma clang loop vectorize_width(N)` as shown in the [article](https://easyperf.net/blog/2017/11/09/Multiversioning_by_trip_counts)[^30]on easyperf blog. +The solution to this problem is to force the vectorizer to use a lower vectorization factor or unroll count, to reduce the number of elements that loops process and enable more loops with lower trip counts to visit the fast vectorized loop body. Developers can achieve that with the help of `#pragma` hints. For the Clang compiler, one can use `#pragma clang loop vectorize_width(N)` as shown in the [article](https://easyperf.net/blog/2017/11/09/Multiversioning_by_trip_counts)[^30]on Easyperf blog. #### Loop Vectorized in a Suboptimal Way. @@ -170,7 +170,7 @@ Vectorization can also be achieved by rewriting parts of a program in a programm One such parallel language is Intel® Implicit SPMD Program Compiler [(ISPC)](https://ispc.github.io/),[^33] which we will cover a bit in this section. The ISPC language is based on the C programming language and uses the LLVM compiler infrastructure to emit optimized code for many different architectures. The key feature of ISPC is the "close to the metal" programming model and performance portability across SIMD architectures. It requires a shift from the traditional thinking of writing programs but gives programmers more control over CPU resource utilization. -Another advantage of ISPC is its interoperability and ease of use. ISPC compiler generates standard object files that can be linked with the code generated by conventional C/C++ compilers. ISPC code can be easily plugged into any native project since functions written with ISPC can be called as if it was C code. +Another advantage of ISPC is its interoperability and ease of use. ISPC compiler generates standard object files that can be linked with the code generated by conventional C/C++ compilers. ISPC code can be easily plugged into any native project since functions written with ISPC can be called as if it were C code. [@lst:ISPC_code] shows a simple example of a function that we presented earlier in [@lst:VectIllegal], rewritten with ISPC. ISPC considers that the program will run in parallel instances, based on the target instruction set. For example, when using SSE with `float`s, it can compute 4 operations in parallel. Each program instance would operate on vector values of `i` being `(0,1,2,3)`, then `(4,5,6,7)`, and so on, effectively computing 4 sums at a time. As you can see, a few keywords not typical for C and C++ are used: @@ -210,7 +210,7 @@ Since the function `calcSum` must return a single value (a `uniform` variable) a [^33]: ISPC compiler: [https://ispc.github.io/](https://ispc.github.io/). [^34]: Some parts of the Unreal Engine that used SIMD intrinsics were rewritten using ISPC, which gave speedups: [https://software.intel.com/content/www/us/en/develop/articles/unreal-engines-new-chaos-physics-system-screams-with-in-depth-intel-cpu-optimizations.html](https://software.intel.com/content/www/us/en/develop/articles/unreal-engines-new-chaos-physics-system-screams-with-in-depth-intel-cpu-optimizations.html). [^35]: But the scalar version of the loop still may be unrolled. -[^36]: See example on the easyperf blog: [https://easyperf.net/blog/2017/11/03/Multiversioning_by_DD](https://easyperf.net/blog/2017/11/03/Multiversioning_by_DD). +[^36]: See the example on the Easyperf blog: [https://easyperf.net/blog/2017/11/03/Multiversioning_by_DD](https://easyperf.net/blog/2017/11/03/Multiversioning_by_DD). [^37]: It is a GCC-specific pragma. For other compilers, check the corresponding manuals. [^38]: For more details read this blog post: [https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html](https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html). [^39]: Study of AVX-512 downclocking: in [VQSort readme](https://github.com/google/highway/blob/master/hwy/contrib/sort/README.md#study-of-avx-512-downclocking)