-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track register pressure in SWP and PreMISched #49
Conversation
} | ||
|
||
/// Look for INSERT_SUBREG that can be rewritten as REG_SEQUENCE | ||
bool combineINSERT_SUBREG(MachineBasicBlock &MBB) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice combine pattern!
})) | ||
continue; | ||
|
||
// Find the max latency one can "move" from predecessors to successors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused at this point. Here the comment says that we are looking for max latency, but in fact we are searching for the min latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that confusing myself 😆 I essentially want to find the maximum "amount of latency" that I can move from predecessors to successors. Given that I do not want to make latencies negative, I can only subtract the min
of all predecessor latencies. I'd be happy to find a better way to rephrase that :) I can also add examples, it's mostly useful for REG_SEQUENCE
at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
common predecessor latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could create a new edge with the effective latency for each pair of in- and out- edges and make all incoming latencies zero.
// The default policy is to avoid tracking pressure for "small regions". For | ||
// AIE, it is critical to estimate the pressure everywhere, especially small | ||
// loops. Spills are very expensive. | ||
Policy.ShouldTrackPressure = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it will be useful to have a hidden command line option disabling this? I think it can help the comparison without a rebuild, as some regression can be expected at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always happy to add more options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps in the form of the value for 'small' ?
if (!U.isReg() || !U.getReg().isVirtual()) | ||
continue; | ||
LaneBitmask LiveLanes = | ||
LiveRegs.contains(U.getReg()) & (~DefinedRegs.contains(U.getReg())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have a comment here saying that we are not in SSA anymore. When I see virtual regs I start to think in SSA mode, which is not the case here. I think it is just a small clarification.
return PDiff; | ||
} | ||
|
||
PressureChange getPressureChange(const PressureDiff &PD, bool FindMin = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good candidate to the target-independent part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to see if other targets could make use of it.
|
||
SmallSet<int, 8> | ||
AIE2RegisterInfo::getCoveringSubRegs(const TargetRegisterClass &RC) const { | ||
// TODO: This could be generated from TableGen by looking at MCRegisters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shocking that this doesn't exist. I guess we could also use this in spill code expansion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spill code expansion is a bit different as it already deals with physical registers
bool AIEPreRASchedStrategy::isAvailableNode(SUnit &SU, SchedBoundary &Zone, | ||
bool /*VerifyReadyCycle*/) const { | ||
// Force verifying if SU is ready to be scheduled in terms of cycle. | ||
return MachineSchedStrategy::isAvailableNode(SU, Zone, | ||
/*VerifyReadyCycle=*/true); | ||
bool Avail = MachineSchedStrategy::isAvailableNode(SU, Zone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: const bool
// The node can be scheduled, but check if it increases the pressure too much. | ||
// If so, try to delay it until another instruction decreases the pressure. | ||
const RegPressureTracker &BotRPT = DAG->getBotRPTracker(); | ||
PressureChange WorstPC = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: const PressureChange
// Recursively traverse INSERT_SUBREG chains in a same MBB. | ||
std::function<void(const MachineInstr &)> Impl = [&](const MachineInstr &MI) { | ||
assert(MI.getOpcode() == TargetOpcode::INSERT_SUBREG); | ||
Subregs.try_emplace(MI.getOperand(3).getImm(), MI.getOperand(2).getReg()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps have a comment with the INSERT_SUBREG signature.
return true; | ||
} | ||
|
||
unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[WorstPC.getPSet()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: const unsigned CurrPressure
} | ||
|
||
for (const SUnit *PendingSU : Zone.Pending) { | ||
PressureDiff PDiff = estimatedPressureDiff(*PendingSU, BotRPT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: const PressureDiff PDiff
for (const auto &[SubregIdx, Reg] : Subregs) { | ||
MIB.addReg(Reg).addImm(SubregIdx); | ||
} | ||
MI.eraseFromParent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could theoretically be part of another INSERT_SUBREG chain, which would then need to recognize INSERT_SUBREG on top of the newly created REQ_SEQUENCE. Not worth it probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought about it, so far it's good enough for most benchmarks.
unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()]; | ||
unsigned Threshold = | ||
TRI->getRegPressureSetLimit(*CurMBB->getParent(), PC.getPSet()); | ||
return Threshold <= 4 || CurrPressure >= Threshold - 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this number 4 a tuning option? If yes, we could have an option to change it...
assert(MI.getOpcode() == TargetOpcode::INSERT_SUBREG); | ||
Subregs.try_emplace(MI.getOperand(3).getImm(), MI.getOperand(2).getReg()); | ||
MachineInstr &SrcMI = *MRI.getVRegDef(MI.getOperand(1).getReg()); | ||
if (SrcMI.getParent() == MI.getParent() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the basic block restriction? We're only rewriting the top INSERT_SUBREG and leave the reset to DCE. I guess it would just work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid of loop nests. I would not want to rewrite INSERT_SUBREG instructions that have different nesting levels
auto IsNearCritical = [&](const PressureChange &PC) { | ||
if (!PC.isValid()) | ||
return false; | ||
unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
const unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()];
const unsigned Threshold...
TRI->getRegPressureSetLimit(*CurMBB->getParent(), PC.getPSet()); | ||
return Threshold <= 4 || CurrPressure >= Threshold - 4; | ||
}; | ||
PressureChange TryCandPC = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: can be const as well.
return true; | ||
} | ||
|
||
bool AIEPreRASchedStrategy::tryCandidate(SchedCandidate &Cand, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it would be possible to add a test that just runs the machine scheduler and presents an easier way to see the effects of this change. Actually, the effects can be seen only indirectly through other tests. The changed tests can give an insight for it....
|
||
// Only look at COPY and REG_SEQUENCE if requested | ||
if (OnlyCopyLike && !MI.isCopy() && | ||
MI.getOpcode() != TargetOpcode::REG_SEQUENCE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess EXTRACT_SUBREG could have the same treatment. Or whatever it is that splits registers for e.g. multi-reg store.
|
||
// Only look at COPY and REG_SEQUENCE if requested | ||
if (OnlyCopyLike && !MI.isCopy() && | ||
MI.getOpcode() != TargetOpcode::REG_SEQUENCE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, me may want to ignore cross reg-bank copies.
unsigned NumRegionInstrs) const { | ||
// The default policy is to avoid tracking pressure for "small regions". For | ||
// AIE, it is critical to estimate the pressure everywhere, especially small | ||
// loops. Spills are very expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, right. I guess 'small' is defined by some absolute constant that defines it to match some architecture's wishes. I guess a better interface would pass in the region and let you dynamically decide on the interesting pressure classes.
return true; | ||
} | ||
|
||
// Bias PhysReg Defs and copies to their uses and defined respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: defines, or defs
if (Pressure.MaxSetPressure[I] > Limit) { | ||
LLVM_DEBUG(dbgs() << TRI->getRegPressureSetName(I) << " Limit " << Limit | ||
<< " Actual " << Pressure.MaxSetPressure[I] << "\n"); | ||
PressureExcess = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return true immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted a chance to debug-print all the critical pressure sets
if (isNoHazardMetaInstruction(Instr->getOpcode())) { | ||
MetaInstrs.push_back(Instr); | ||
return; | ||
} | ||
// Check if the pre-condition is ensured | ||
assert(!isStandalone() && | ||
assert((!ComputeSlots || !isStandalone()) && | ||
"Tried to add an instruction in a standalone Bundle"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an intuitive feeling that we should have a corresponding change in canAdd, similar to the handling of isStandAlone()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not query canAdd
anymore in the scheduler, that's why I didn't add it. For symmetry, I can do so. Should I? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That's a "no, don't.")
@@ -60,6 +60,9 @@ static cl::opt<bool> | |||
static cl::opt<bool> | |||
AllocateMRegsFirst("aie-mod-ra-first", cl::Hidden, cl::init(false), | |||
cl::desc("Allocate M registers first in staged RA.")); | |||
static cl::opt<bool> EnablePreMISchedCoalescer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ambiguous name, might be construed as "Coalescer running before MISched"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I rename to --aie-coalescer-after-premisched
maybe?
|
||
// Pre-RA scheduling might have exposed simplifiable copies. | ||
if (EnablePreMISchedCoalescer) | ||
addPass(&RegisterCoalescerID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you have an explicit example where it helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does help a lot with the GEMM_bf16 kernel after all the "pressure-reducing" scheduling is done. Then this really forces greedy into allocating the same vreg and limits the number of copies. Or did you want me to add a test for RegisterCoalescer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the option explosion and the many ways we make exceptions for particular situations like 'unknown slots'.
I would love to encode 'stand-alone' instruction by reserving a 'stand-alone' slot with a unique format, and assert that we only get correct stuff.
bf0a949
to
d095f1b
Compare
I think I have addressed most of the comments in |
@@ -337,6 +344,119 @@ DownCountLoop::Assessment DownCountLoop::accept(MachineInstr *EndLoop) { | |||
return Assessment::Accept; | |||
} | |||
|
|||
/// Get an instruction sequence from an \p SMS schedule that is estimated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: a \p SMS
I can see that this PR presents really promising results, especially for some innermost loops. It is also very positive for stack usage reduction, which shows that the pressure information is definitely important. GEMM_int8_0 is a good case to take a look at in the future (for stack). |
Discussed offline:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This is in particular useful for the SW pipeliner, as it simplifies the DAG and makes the inputs to the REG_SEQUENCE more likely to end up in the same stage.
This is mainly useful for the SW pipeliner to keep instructions like REG_SEQUENCE in the same stage as their inputs. This then helps reducing the number of COPY instructions.
This also excludes some compound reg classes from the pressure sets computation. This is because SPARSE registers are pairs of X and Q registers. As there are only 4 Q registers, this causes the pressure threshold for QX (SPARSE) registers to be pretty low, and having live X registers would essentially always cause it to be exceeded. The commit had generally a slightly negative effect on QoR, but the performance will be regained when tracking pressure more finely in a future commit.
This uses the current live registers to compute the pressure changes of candidates. If an instruction is likely to cause spills and another pending instruction can help reduce the pressure, then the former is delayed.
This can be used to disregard schedules that have e.g. too much register pressure.
If it is estimated the RA pipeline will spill, then the II is increased. This will typically increase the number of stages and the number of registers that need to be carried between stages.
This does two things: 1. Do not block a whole cycle for instructions with an unknown VLIW slot. Typically those are COPY instructions. This can be tweaked with --aie-premisched-ignore-unknown-slots=0/1 2. Track scoreboard conflicts. This can hurt by delaying instructions that require late resources due to the MachineScheduler only inserting instructions in the current cycle. On average, this brings QoR improvements though. Tweakable with --aie-premisched-fu-depth=int
639a6d2
to
b1f26fa
Compare
The point of this PR is really to pay more attention to register pressure and try our best to avoid spills. This is in particular very useful for SW pipelined loops where the reg pressure is very high.
TODO:
"unit-test" the new PreMISched feature to delay instructions until the pressure goes back downQoR looks good and we now reached our Q2 goals. Obviously if it hard to find the best set of options that work with all benchmarks, but I believe the current state strikes a good balance.
Useful customization options are:
--aie-premisched-ignore-unknown-slots=0/1
,--aie-premisched-fu-depth=int
,aie-pipeliner-track-regpressure=0/1
,--aie-premisched-coalescer=0/1
I do not know why github displays .md tables so badly, the "preview" works and adds a scrollbar, but the actual render does not. 😞In the meantime I'll keep the tables inside a code block