Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PIPELINER] Refactor pipeliner lowering. #5989

Conversation

pawelszczerbuk
Copy link
Contributor

This change reworks the pipeliner flow in triton. It systematizes the pipeliner transformations by making all of them part of the same SoftwarePipeliner pass, while making them modular and defining clear IR interfaces between them.
It also introduces new LowerLoop transformation that attempts to be more generic async operations lowering, written with minimal amount of assumptions of the IR shape that is coming from the pipeline scheduling sub-pass.

Copy link
Collaborator

@Mogball Mogball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So TLDR is this seems like a fairly coarse grained refactor of loop-scheduling+software pipelining into clear separable steps?

// numStages) to the them, trying to populate the allowed stages. This
// step will be at some point extracted to separate pass that will be run
// only for loops missing the latency information.
assignLatencies(moduleOp, numStages);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOoooooooohhhh. This looks super nice!

}
// Wait until there are 0 outstanding async dot ops.
builder.setInsertionPointAfter(forOp);
auto WarpGroupDotWaitAfterLoop = builder.create<ttng::WarpGroupDotWaitOp>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the long awaited death of this file!

@pawelszczerbuk
Copy link
Contributor Author

So TLDR is this seems like a fairly coarse grained refactor of loop-scheduling+software pipelining into clear separable steps?

Yeah, with some improvements to be able to lower more or less anything you can throw at it, without making assumptions of what can come out of current scheduling.

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -0,0 +1,889 @@
// RUN: triton-opt %s -split-input-file -allow-unregistered-dialect -tritongpu-test-pipeline-lower-loop -canonicalize | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice!

Copy link
Collaborator

@Mogball Mogball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work. This is significantly cleaner and better layered!

@Mogball
Copy link
Collaborator

Mogball commented Feb 25, 2025

So TLDR is this seems like a fairly coarse grained refactor of loop-scheduling+software pipelining into clear separable steps?

Yeah, with some improvements to be able to lower more or less anything you can throw at it, without making assumptions of what can come out of current scheduling.

Are there any specific changes you can call out that made the overall pipeliner more robust? I'm curious to know about them and it's not obvious from reading the PR? :P

@ThomasRaoux
Copy link
Collaborator

So TLDR is this seems like a fairly coarse grained refactor of loop-scheduling+software pipelining into clear separable steps?

Yeah, with some improvements to be able to lower more or less anything you can throw at it, without making assumptions of what can come out of current scheduling.

Are there any specific changes you can call out that made the overall pipeliner more robust? I'm curious to know about them and it's not obvious from reading the PR? :P

one thing I'm very excited about is in this PR is that significantly improves testability of different pieces of the pipeliner. For instance the lowering can be tested independently and we can independently test all the corner cases

@pawelszczerbuk
Copy link
Contributor Author

So TLDR is this seems like a fairly coarse grained refactor of loop-scheduling+software pipelining into clear separable steps?

Yeah, with some improvements to be able to lower more or less anything you can throw at it, without making assumptions of what can come out of current scheduling.

Are there any specific changes you can call out that made the overall pipeliner more robust? I'm curious to know about them and it's not obvious from reading the PR? :P

one thing I'm very excited about is in this PR is that significantly improves testability of different pieces of the pipeliner. For instance the lowering can be tested independently and we can independently test all the corner cases

The other gap that this PR closed is introducing a fallback to pipelining in registers. Previously there was a handshake between scheduling and lowering where scheduling was not supposed to generate anything that lowering couldn't pipeline in shmem. New lowering can always fallback to pipelining in registers and should be able to pipeline basically any scheduled IR that comes its way.

@lezcano
Copy link
Contributor

lezcano commented Feb 25, 2025

Does this allow us to completely kill

void LayoutRematerialization::hoistConvertDotOperand(
then? That'd be nice.

@pawelszczerbuk pawelszczerbuk merged commit 852c05f into triton-lang:main Feb 26, 2025
7 checks passed
@pawelszczerbuk
Copy link
Contributor Author

Does this allow us to completely kill

void LayoutRematerialization::hoistConvertDotOperand(

then? That'd be nice.

Not yet, here lowering is still picking up shared layout. I'll look into separating layout selection to a separate pass after the pipeliner, which should remove the need for this guy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants