-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PIPELINER] Refactor pipeliner lowering. #5989
[PIPELINER] Refactor pipeliner lowering. #5989
Conversation
…to IR. Imporving debug dumps
…out blocked layout optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So TLDR is this seems like a fairly coarse grained refactor of loop-scheduling+software pipelining into clear separable steps?
// numStages) to the them, trying to populate the allowed stages. This | ||
// step will be at some point extracted to separate pass that will be run | ||
// only for loops missing the latency information. | ||
assignLatencies(moduleOp, numStages); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOoooooooohhhh. This looks super nice!
} | ||
// Wait until there are 0 outstanding async dot ops. | ||
builder.setInsertionPointAfter(forOp); | ||
auto WarpGroupDotWaitAfterLoop = builder.create<ttng::WarpGroupDotWaitOp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the long awaited death of this file!
Yeah, with some improvements to be able to lower more or less anything you can throw at it, without making assumptions of what can come out of current scheduling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
include/triton/Dialect/TritonGPU/Transforms/PipeliningUtility.h
Outdated
Show resolved
Hide resolved
include/triton/Dialect/TritonGPU/Transforms/PipeliningUtility.h
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,889 @@ | |||
// RUN: triton-opt %s -split-input-file -allow-unregistered-dialect -tritongpu-test-pipeline-lower-loop -canonicalize | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work. This is significantly cleaner and better layered!
Are there any specific changes you can call out that made the overall pipeliner more robust? I'm curious to know about them and it's not obvious from reading the PR? :P |
one thing I'm very excited about is in this PR is that significantly improves testability of different pieces of the pipeliner. For instance the lowering can be tested independently and we can independently test all the corner cases |
… async cp lowering
The other gap that this PR closed is introducing a fallback to pipelining in registers. Previously there was a handshake between scheduling and lowering where scheduling was not supposed to generate anything that lowering couldn't pipeline in shmem. New lowering can always fallback to pipelining in registers and should be able to pipeline basically any scheduled IR that comes its way. |
Does this allow us to completely kill
|
Not yet, here lowering is still picking up shared layout. I'll look into separating layout selection to a separate pass after the pipeliner, which should remove the need for this guy. |
This change reworks the pipeliner flow in triton. It systematizes the pipeliner transformations by making all of them part of the same SoftwarePipeliner pass, while making them modular and defining clear IR interfaces between them.
It also introduces new LowerLoop transformation that attempts to be more generic async operations lowering, written with minimal amount of assumptions of the IR shape that is coming from the pipeline scheduling sub-pass.