-
Notifications
You must be signed in to change notification settings - Fork 16
Compiler Flags for Different Architectures
Last updated: 3/29/19
This page lists flags that are used to compile Spatter and STREAM comparisons on different architectures.
Some general notes for STREAM can be found at this blog post:
Additionally:
- ICC generally will generate the best quality code for STREAM and Spatter on Intel architectures.
- Streaming loads/stores may be needed to increase performance to "peak" performance.
Common flags for Intel compilers with OpenMP backend:
-Ofast -qopenmp -qopenmp-link=static -fargument-noalias
TBD - when do we use -ffreestanding
?
Note that in many cases, you can check for vectorized instructions by generating the assembly with the -S
flag or by using objdump -d <compiled_app>
to look at the assembly code. As mentioned in this StackOverflow post, you want to look for instructions with names like vgatherpf0qpd
.
Architecture | Short Name | Compiler | Flags | Notes |
---|---|---|---|---|
Sandy Bridge | SNB | icc | -march=sandybridge | |
Broadwell | BDW | icc | -march=broadwell | |
Skylake | SKL | icc | -march=skylake | |
Skylake with AVX512 | SKL | icc | -march=skylake-avx512 | |
cce | -hvector2 or -hvector3 | moderate or aggressive vectorization | ||
-hvector1 or -hscalar1/2/3 | limited automatic vectorization | |||
Knight's Landing with AVX512 and MCDRAM | KNL | icpc | icpc -xCOMMON-AVX512 | Compilation notes |
Power9 | PWR9 | codexl | Use xlc_r to create thread-safe version of Spatter |
|
-qtune=pwr9 | Tune for Power9 arch (auto tunes for arch where compiled) | |||
-qsimd=auto | Implied for -O3 or higher opt level | |||
-qenablevmx | Enable vector generation | |||
-qhot=vector | ||||
ARM TX2 | TX2 | armclang | -O3 -mcpu=native | Let compiler decide based on host |
-O3 -mcpu=thunderx2t99 | ||||
gcc | -ftree-vectorize |
To use HBM on KNL:
#Check mem settings
numactl -H
#Run on NUMA mem region 1 (HBM)
numactl --membind 1 ./run-app
Returns info on which loops were vectorized and why:
-qopt-report=1 -qopt-report-phase=vec
Returns info on loops that were not vectorized and why:
-qopt-report-phase=vec,loop -qopt-report=2
You can also use the following high-level flag:
-vec-report=3
-qreport
or -qlist
flags can be used to generate high-order transformation (HOT) reports or print an object listing of the code.
Using the ARMHPC compiler, we can also print out the vectorization report:
-Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize
As an example:
armclang -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize example.c -gline-tables-only 2> vecreport.txt
Alternatively, you will need to use the armllvm-objdump with the correct disassemble flags or you can use the -S flag to generate the assembly code during compilation. #From the basic SVE example; ld1w and st1w are SVE instructions $armllvm-objdump -disassemble -mattr=+sve example &> example.dis #Sample output from example.dis - ld1w and st1w are both SVE instructions 400898: a0 42 48 a5 ld1w { z0.s }, p0/z, [x21, x8, lsl #2] 40089c: c1 42 48 a5 ld1w { z1.s }, p0/z, [x22, x8, lsl #2] 4008a0: 00 04 a1 04 sub z0.s, z0.s, z1.s 4008a4: e0 42 48 e5 st1w { z0.s }, p0, [x23, x8, lsl #2]
#Option 2 - generate assembly during compilation $armclang -O3 -S --target=aarch64-arm-none-eabi -march=armv8-a+sve -o example.s example.c #Sample output from example.s .LBB1_3: // =>This Inner Loop Header: Depth=1 ld1w { z0.s }, p0/z, [x21, x8, lsl #2] ld1w { z1.s }, p0/z, [x22, x8, lsl #2] sub z0.s, z0.s, z1.s st1w { z0.s }, p0, [x23, x8, lsl #2]
-fopt-info-missed-vec
or -fopt-info-vec-missed=vec.miss
to print to a file
-Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize