Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerics issue with vectorized conv2d #820

Open
newling opened this issue Oct 3, 2024 · 2 comments · May be fixed by #867
Open

Numerics issue with vectorized conv2d #820

newling opened this issue Oct 3, 2024 · 2 comments · May be fixed by #867

Comments

@newling
Copy link
Contributor

newling commented Oct 3, 2024

Enabling vectorization (see #789) for convolution results in numerical failure. The values are off only slightly (although they are definitely not correct, there is no floating point rounding issue here). Experiments with the input values suggests that the problem is that the input image data (i.e. not the kernel data) is being read incorrectly, with an incorrect offset inside the scf.for loop (not confirmed).

Some things I've tried:

Setting optimization flags in LLVM to 0 (the default is 2) has no effect.
Inverting the scf.for loop order has no effect.
Using different versions of peano has no effect.

This task is to find the source of the error, and if it's peano create a reproducer for the team.

Attached are the ll and opt.ll files (vectorized and unvectorized) : ll_files.zip (they're quite small, vectorized_input.opt.ll is 94 lines only)

The MLIR IR looks basically identical except for the inner-loop.

// With vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %0 = vector.transfer_read %reinterpret_cast_21[%c0, %arg1, %arg3, %arg2, %c0], %cst {in_bounds = [true, true]} : memref<1x3x4x6x8xbf16>, vector<4x8xbf16>
      %1 = vector.transfer_read %reinterpret_cast_22[%arg1, %arg2, %arg3, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x4x1x8x4xbf16>, vector<8x4xbf16>
      %2 = vector.transfer_read %reinterpret_cast[%c0, %c0, %c0, %c0, %c0], %cst_20 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : memref<1x1x4x1x4xf32>, vector<4x4xf32>
      %3 = arith.extf %0 : vector<4x8xbf16> to vector<4x8xf32>
      %4 = arith.extf %1 : vector<8x4xbf16> to vector<8x4xf32>
      %5 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %3, %4, %2 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
      vector.transfer_write %5, %reinterpret_cast[%c0, %c0, %c0, %c0, %c0] {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : vector<4x4xf32>, memref<1x1x4x1x4xf32>
    }
  }
}

// Without vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %subview = memref.subview %reinterpret_cast_20[%arg1, %arg2, %arg3, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x4x1x8x4xbf16> to memref<8x4xbf16, strided<[4, 1], offset: ?>>
      %subview_25 = memref.subview %reinterpret_cast[0, 0, 0, 0, 0] [1, 1, 4, 1, 4] [1, 1, 1, 1, 1] : memref<1x1x4x1x4xf32> to memref<4x4xf32, strided<[4, 1]>>
      %subview_26 = memref.subview %reinterpret_cast_21[0, %arg1, %arg3, %arg2, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x4x6x8xbf16> to memref<4x8xbf16, strided<[8, 1], offset: ?>>
      scf.for %arg4 = %c0 to %c4 step %c1 {
        scf.for %arg5 = %c0 to %c4 step %c1 {
          scf.for %arg6 = %c0 to %c8 step %c1 {
            %0 = memref.load %subview_26[%arg4, %arg6] : memref<4x8xbf16, strided<[8, 1], offset: ?>>
            %1 = memref.load %subview[%arg6, %arg5] : memref<8x4xbf16, strided<[4, 1], offset: ?>>
            %2 = memref.load %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
            %3 = arith.extf %0 : bf16 to f32
            %4 = arith.extf %1 : bf16 to f32
            %5 = arith.mulf %3, %4 : f32
            %6 = arith.addf %2, %5 : f32
            memref.store %6, %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
          }
        }
      }
    }
  }
}

In the vectorized case, the vector.contract gets lowered to a aievec.matmul which in turn gets lowered to

 %38 = "xllvm.intr.aie2.bf.mac16.conf"(%33, %34, %37, %36) : (vector<32xbf16>, vector<32xbf16>, vector<8xi64>, i32) -> vector<8xi64> 
 ``
@newling newling changed the title Numerics issue with vectorized matmul Numerics issue with vectorized con2d Oct 3, 2024
@newling newling changed the title Numerics issue with vectorized con2d Numerics issue with vectorized conv2d Oct 3, 2024
@newling
Copy link
Contributor Author

newling commented Oct 3, 2024

I've narrowed it down to the alignment of the loads : if the alignments are set/forced sufficiently low, then the numerics with vectorization are fine. I am a bit perplexed why the alignment needs to be as low as it does, consider the 2 IRs in the zip file:
opts.zip

They are basically identical, except for some loads which have alignment 4 in the one file, and alignment 2 in the other. The case with alignment 2 gives numerically correct results, the one with alignment 4 does not. What I find confusing is that alignment 4 is surely enough here: none of the strides in any of the loads is less than 8.

@newling
Copy link
Contributor Author

newling commented Oct 4, 2024

Running the above 2 IRs through compiler explorer

http://xsjsda132:10240/z/6qc4cc

align 2:
with_align_2.txt
align 4:
with_align_4.txt

newling added a commit that referenced this issue Oct 9, 2024
This PR switches all numerical convolution tests to use the objectFifo
pipeline. With respect to the new tiling strategy:

1) A single **column** is currently used. Targeting multiple columns
results in ` error: 'aie.memtile_dma' op could not find and assign a
valid BD id`. This will will be investigated as follow-up work:
#821

2) There is no longer interleaving of compute and L2->L1 data movement,
which means #619 becomes
low priority / obsolete

3) L3->L2, L2->L3 still uses padding. But L2->L1, L1->L2 uses packing. 

4) Channel-first convolution is completely unsupported, we expect high
level transforms to convert to channel last before reaching our backend.

5) Vectorization is not currently enabled, due to issues with alignment.
See follow-up task #820.
This is functionally ok for now, as peano can scalarize code for all
data types.
@newling newling linked a pull request Oct 31, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant