[Flang] TSVC s233: loop interchange is necessary for vectorization #110612

yus3710-fj · 2024-10-01T01:14:41Z

Flang can't vectorize the loop in s233 of TSVC while Clang can vectorize the loop written in C.
(Clang doesn't actually vectorize the loop because the vectorization of strided accesses is less beneficial.)

Fortran

! Fortran version
      subroutine s233 (ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)

      integer ntimes, ld, n, i, nl, j
      real a(n), b(n), c(n), d(n), e(n), aa(ld,n), bb(ld,n), cc(ld,n)

      call init(ld,n,a,b,c,d,e,aa,bb,cc,'s233 ')
      do 10 i = 2,n
         do 20 j = 2,n
            aa(i,j) = aa(i,j-1) + cc(i,j)
  20     continue
         do 30 j = 2,n
            bb(i,j) = bb(i-1,j) + cc(i,j)
  30     continue
  10  continue
      call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.)
      end

$ flang-new -v -O3 -flang-experimental-integer-overflow s233.f -S -Rpass=vector -Rpass-analysis=vector -Rpass-missed=vector
flang-new version 20.0.0git (https://github.com/llvm/llvm-project.git 2c770675ce36402b51a320ae26f369690c138dc1)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /path/to/build/bin
Build config: +assertions
Found candidate GCC installation: /usr/lib/gcc/aarch64-redhat-linux/11
Selected GCC installation: /usr/lib/gcc/aarch64-redhat-linux/11
Candidate multilib: .;@m64
Selected multilib: .;@m64
 "/path/to/build/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -target-cpu generic -target-feature +outline-atomics -target-feature +v8a -target-feature +fp-armv8 -target-feature +neon -fversion-loops-for-stride -flang-experimental-integer-overflow -Rpass=vector -Rpass-analysis=vector -Rpass-missed=vector -resource-dir /path/to/build/lib/clang/20 -mframe-pointer=non-leaf -O3 -o /dev/null -x f95-cpp-input s233.f
path/to/s233.f:13:13: remark: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Unsafe indirect dependence. Memory location is the same as accessed at s233.f:13:13 [-Rpass-analysis=loop-vectorize]
path/to/s233.f:12:10: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
path/to/s233.f:10:13: remark: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Unsafe indirect dependence. Memory location is the same as accessed at s233.f:10:13 [-Rpass-analysis=loop-vectorize]
path/to/s233.f:9:10: remark: loop not vectorized [-Rpass-missed=loop-vectorize]

C

// C version
#define LEN 32000
#define LEN2 256
float a[LEN], b[LEN], c[LEN], d[LEN], e[LEN];
float aa[LEN2][LEN2], bb[LEN2][LEN2], cc[LEN2][LEN2];

int s233() {
  init( "s233 ");
  start_t = clock();

  for (int i = 1; i < LEN2; i++) {
    for (int j = 1; j < LEN2; j++) {
      aa[j][i] = aa[j-1][i] + cc[j][i];
    }
    for (int j = 1; j < LEN2; j++) {
      bb[j][i] = bb[j][i-1] + cc[j][i];
    }
  }
  dummy(a, b, c, d, e, aa, bb, cc, 0.);
  return 0;
}

$ clang -O3 s233.c -S -Rpass=vector -Rpass-analysis=vector -Rpass-missed=vector
s233.c:15:4: remark: the cost-model indicates that vectorization is not beneficial [-Rpass-analysis=loop-vectorize]
   15 |                         for (int j = 1; j < LEN2; j++) {
      |                         ^
s233.c:15:4: remark: interleaved loop (interleaved count: 2) [-Rpass=loop-vectorize]
s233.c:13:16: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize]
   13 |                                 aa[j][i] = aa[j-1][i] + cc[j][i];
      |                                            ^
s233.c:12:4: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
   12 |                         for (int j = 1; j < LEN2; j++) {
      |                         ^
s233.c:16:16: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer]
   16 |                                 bb[j][i] = bb[j][i-1] + cc[j][i];
      |                                            ^

One of the causes seems same as #110611. In this case, however, some loop optimizations such as loop interchange can help vectorization. Address calculations are linearlized in LLVM IR, so loop optimizations in MLIR or using the polyhedral model might be necessary.

  %82 = load i32, ptr %18, align 4 ;; i
  %83 = sext i32 %82 to i64
  %84 = load i32, ptr %17, align 4 ;; j
  %85 = sub i32 %84, 1
  %86 = sext i32 %85 to i64
  %87 = sub nsw i64 %83, 1
  %88 = mul nsw i64 %87, 1
  %89 = mul nsw i64 %88, 1
  %90 = add nsw i64 %89, 0
  %91 = mul nsw i64 1, %29 ;; ld
  %92 = sub nsw i64 %86, 1
  %93 = mul nsw i64 %92, 1
  %94 = mul nsw i64 %93, %91
  %95 = add nsw i64 %94, %90 ;; (i-1) + ((j-1)-1) * ld
  %96 = mul nsw i64 %91, %33 ;; ld*n
  %97 = getelementptr float, ptr %10, i64 %95
  %98 = load float, ptr %97, align 4 ;; aa(i,j-1)

The text was updated successfully, but these errors were encountered:

yus3710-fj added loopoptim flang Flang issues not falling into any other category labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flang] TSVC s233: loop interchange is necessary for vectorization #110612

[Flang] TSVC s233: loop interchange is necessary for vectorization #110612

yus3710-fj commented Oct 1, 2024

[Flang] TSVC s233: loop interchange is necessary for vectorization #110612

[Flang] TSVC s233: loop interchange is necessary for vectorization #110612

Comments

yus3710-fj commented Oct 1, 2024