Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalescing loops #907

Open
newling opened this issue Nov 16, 2024 · 0 comments
Open

Coalescing loops #907

newling opened this issue Nov 16, 2024 · 0 comments

Comments

@newling
Copy link
Contributor

newling commented Nov 16, 2024

For iree-amd-aie's matmul pipeline, this is state we're in with coalescing enabled, just before affine-to-standard pass

#map = affine_map<()[s0, s1] -> (s0 * 256 + s1 * 32)>
#map1 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 32)>
#map2 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 16)>
...

scf.for %arg1 = %c0 to %c256 step %c1 {
  %0:3 = affine.delinearize_index %arg1 into  (8, 8, 4) : index, index, index
  %1 = affine.apply #map()[%0#2, %0#0]
  %3 = affine.apply #map1()[%0#1, %0#2]
  %5 = affine.apply #map2()[%0#1, %0#0]
  %2 = vector.transfer_read %collapse_shape_85[%1], %cst {in_bounds = [true]} : memref<1024xbf16, strided<[1]>>, vector<32xbf16>
  %4 = vector.transfer_read %collapse_shape_86[%3], %cst {in_bounds = [true]} : memref<1024xbf16, strided<[1]>>, vector<32xbf16>
  %6 = vector.transfer_read %collapse_shape_87[%5], %cst_50 {in_bounds = [true]} : memref<1024xf32, strided<[1]>>, vector<16xf32>
  ...
}

And if there is no coalescing, the IR looks like:

#map = affine_map<()[s0, s1] -> (s0 * 256 + s1 * 32)>
#map1 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 32)>
#map2 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 16)>
...
scf.for %arg1 = %c0 to %c8 step %c1 {
  scf.for %arg2 = %c0 to %c8 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %1 = affine.apply #map()[%arg3, %arg1]
      %3 = affine.apply #map1()[%arg2, %arg3]
      %0 = affine.apply #map2()[%arg2, %arg1]
      %2 = vector.transfer_read %collapse_shape[%1], %cst {in_bounds = [true]} : memref<1024xbf16>, vector<32xbf16>
      %4 = vector.transfer_read %collapse_shape_53[%3], %cst {in_bounds = [true]} : memref<1024xbf16>, vector<32xbf16>
      %5 = vector.transfer_read %collapse_shape_54[%0], %cst_50 {in_bounds = [true]} : memref<1024xf32>, vector<16xf32>
      ...
    }
  }
}

Question: why do we want to coalesce here in the first place, afaict it's never possible to have as little index arithmetic in the coalesced case as the uncoalesced case. You might be to convert the coalesced case to something without modular arithmetic, but you'll still need some division shenanigans, which you don't need if you don't coalesce -- without colaescing, %arg1 %arg2 and %arg3 are exactly what we want, and the 3 loops get translated to simple add-compare logic in scf-to-sf. The only reason I can think we would want to coalesce is that after convert-scf-to-cf you then have a cf.cond_br logic which is "simple": just a single variable iterating to 256, rather than a waterfall of 3 counters (to 8, 8, and 4 respectively). But why is this desirable, how does it help llvm/peano?

(@MaheshRavishankar @Abhishek-Varma)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant