Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track register pressure in SWP and PreMISched #49

Merged
merged 9 commits into from
May 29, 2024

Conversation

gbossu
Copy link
Collaborator

@gbossu gbossu commented May 23, 2024

The point of this PR is really to pay more attention to register pressure and try our best to avoid spills. This is in particular very useful for SW pipelined loops where the reg pressure is very high.

TODO:

  • "unit-test" the new PreMISched feature to delay instructions until the pressure goes back down

QoR looks good and we now reached our Q2 goals. Obviously if it hard to find the best set of options that work with all benchmarks, but I believe the current state strikes a good balance.

Useful customization options are: --aie-premisched-ignore-unknown-slots=0/1, --aie-premisched-fu-depth=int, aie-pipeliner-track-regpressure=0/1, --aie-premisched-coalescer=0/1

I do not know why github displays .md tables so badly, the "preview" works and adds a scrollbar, but the actual render does not. 😞In the meantime I'll keep the tables inside a code block


| Core_Insn_Count                                                                      | Add2D_Standalone_0 | AvgPool2D_0   | Conv2D_0      | Conv2D_3       | Conv2D_LReLU_0 | Conv2D_ReLU_0 | Conv2D_casc_2 | GEMM_bf16_0   | GEMM_int8_0  | GEMV_0       | GlobalAvgPool2D_0 | MaxPool2D_0  | Mul2D_0      | Pad2D_0      | Averege diff | Diff stdev | Quantile #1 | Quantile #2 | Quantile #3 |
| ------------------------------------------------------------------------------------ | ------------------ | ------------- | ------------- | -------------- | -------------- | ------------- | ------------- | ------------- | ------------ | ------------ | ----------------- | ------------ | ------------ | ------------ | ------------ | ---------- | ----------- | ----------- | ----------- |
| Baseline                                                                             | 3224(+0.00%)       | 3299(+0.00%)  | 10943(+0.00%) | 12337(+0.00%)  | 3094(+0.00%)   | 1716(+0.00%)  | 11935(+0.00%) | 4670(+0.00%)  | 4307(+0.00%) | 689(+0.00%)  | 490(+0.00%)       | 2483(+0.00%) | 1643(+0.00%) | 2190(+0.00%) | +0.00%       | 0.00       | +0.00%      | +0.00%      | +0.00%      |
| Combine INSERT_SUBREG + propagate latencies in SWP                                   | 3224(+0.00%)       | 3299(+0.00%)  | 10513(-3.93%) | 13111(+6.27%)  | 3064(-0.97%)   | 1706(-0.58%)  | 11359(-4.83%) | 4414(-5.48%)  | 4307(+0.00%) | 689(+0.00%)  | 490(+0.00%)       | 2483(+0.00%) | 1643(+0.00%) | 2190(+0.00%) | -0.68%       | 2.81       | -1.71%      | +0.00%      | +0.00%      |
| Always track reg pressure in PreMISched                                              | 3224(+0.00%)       | 3299(+0.00%)  | 10617(+0.99%) | 13104(-0.05%)  | 3075(+0.36%)   | 1719(+0.76%)  | 11351(-0.07%) | 4414(+0.00%)  | 4307(+0.00%) | 689(+0.00%)  | 490(+0.00%)       | 2483(+0.00%) | 1643(+0.00%) | 2190(+0.00%) | +0.14%       | 0.33       | +0.00%      | +0.00%      | +0.09%      |
| More accurate PressureChange computation + delay instructions likely to cause spills | 3225(+0.03%)       | 2840(-13.91%) | 10378(-2.25%) | 10361(-20.93%) | 3093(+0.59%)   | 1704(-0.87%)  | 11221(-1.15%) | 4326(-1.99%)  | 4253(-1.25%) | 656(-4.79%)  | 477(-2.65%)       | 2483(+0.00%) | 1638(-0.30%) | 2189(-0.05%) | -3.54%       | 6.20       | -3.19%      | -1.20%      | -0.03%      |
| Estimate RegPressure in SWP and increase II if necessary                             | 3225(+0.00%)       | 2840(+0.00%)  | 10378(+0.00%) | 9610(-7.25%)   | 3093(+0.00%)   | 1704(+0.00%)  | 11221(+0.00%) | 4326(+0.00%)  | 4253(+0.00%) | 656(+0.00%)  | 477(+0.00%)       | 2483(+0.00%) | 1638(+0.00%) | 2189(+0.00%) | -0.52%       | 1.94       | +0.00%      | +0.00%      | +0.00%      |
| Do not block a whole cycle for instrs with an unknown slot                           | 3225(+0.00%)       | 2840(+0.00%)  | 10387(+0.09%) | 9206(-4.20%)   | 3098(+0.16%)   | 1695(-0.53%)  | 10600(-5.53%) | 4638(+7.21%)  | 4393(+3.29%) | 651(-0.76%)  | 478(+0.21%)       | 2483(+0.00%) | 1689(+3.11%) | 2190(+0.05%) | +0.22%       | 3.05       | -0.59%      | +0.02%      | +0.94%      |
| Model resource conflicts in PreMISched                                               | 3229(+0.12%)       | 2811(-1.02%)  | 10473(+0.83%) | 9207(+0.01%)   | 3105(+0.23%)   | 1704(+0.53%)  | 10646(+0.43%) | 4528(-2.37%)  | 4347(-1.05%) | 650(-0.15%)  | 479(+0.21%)       | 2449(-1.37%) | 1686(-0.18%) | 2184(-0.27%) | -0.29%       | 0.87       | -1.03%      | -0.07%      | +0.28%      |
| Run coalescer again after PreMISched                                                 | 3229(+0.00%)       | 2811(+0.00%)  | 10543(+0.67%) | 8839(-4.00%)   | 3095(-0.32%)   | 1704(+0.00%)  | 10622(-0.23%) | 4160(-8.13%)  | 4347(+0.00%) | 650(+0.00%)  | 479(+0.00%)       | 2449(+0.00%) | 1686(+0.00%) | 2184(+0.00%) | -0.86%       | 2.36       | -0.25%      | +0.00%      | +0.00%      |
| Total diff                                                                           | REGR(+0.16%)       | IMPR(-14.79%) | IMPR(-3.66%)  | IMPR(-28.35%)  | SAME(+0.03%)   | IMPR(-0.70%)  | IMPR(-11.00%) | IMPR(-10.92%) | REGR(+0.93%) | IMPR(-5.66%) | IMPR(-2.24%)      | IMPR(-1.37%) | REGR(+2.62%) | IMPR(-0.27%) | -5.37%       | 8.41       | -10.94%     | -1.81%      | +0.06%      |


|Innemost loop cycles                                                                  | GlobalAvgPool2D_0 | Conv2D_casc_2 | GEMM_bf16_0 | GEMM_int8_0 | Add2D_Standalone_0 | GEMV_0 | Mul2D_0 | AvgPool2D_0 | Pad2D_0 | MaxPool2D_0 | Conv2D_0 | Conv2D_3 | Conv2D_ReLU_0 | Conv2D_LReLU_0 |
| ------------------------------------------------------------------------------------ | ----------------- | ------------- | ----------- | ----------- | ------------------ | ------ | ------- | ----------- | ------- | ----------- | -------- | -------- | ------------- | -------------- |
| Baseline                                                                             | 18                | 15            | 24          | 42          | 43                 | 51     | 82      | 96          | 65      | 63          | 11       | 22       | 11            | 11             |
| Combine INSERT_SUBREG + propagate latencies in SWP                                   | 18                | 14            | 21          | 42          | 43                 | 51     | 82      | 96          | 65      | 63          | 10       | 24       | 10            | 10             |
| Always track reg pressure in PreMISched                                              | 18                | 14            | 21          | 42          | 43                 | 51     | 82      | 96          | 65      | 63          | 11       | 24       | 11            | 11             |
| More accurate PressureChange computation + delay instructions likely to cause spills | 18                | 14            | 21          | 42          | 43                 | 45     | 82      | 77          | 65      | 63          | 11       | 17       | 11            | 11             |
| Estimate RegPressure in SWP and increase II if necessary                             | 18                | 14            | 21          | 42          | 43                 | 45     | 82      | 77          | 65      | 63          | 11       | 15       | 11            | 11             |
| Do not block a whole cycle for instrs with an unknown slot                           | 18                | 13            | 23          | 42          | 43                 | 45     | 85      | 77          | 65      | 63          | 10       | 14       | 10            | 10             |
| Model resource conflicts in PreMISched                                               | 18                | 13            | 23          | 42          | 43                 | 45     | 85      | 77          | 65      | 63          | 10       | 14       | 10            | 10             |
| Run coalescer again after PreMISched                                                 | 18                | 13            | 18          | 42          | 43                 | 45     | 85      | 77          | 65      | 63          | 10       | 13       | 10            | 10             |


| Core_StackSize                                                                       | Add2D_Standalone_0 | AvgPool2D_0   | Conv2D_0      | Conv2D_3      | Conv2D_LReLU_0 | Conv2D_ReLU_0 | Conv2D_casc_2 | GEMM_bf16_0  | GEMM_int8_0   | GEMV_0        | GlobalAvgPool2D_0 | MaxPool2D_0  | Mul2D_0      | Pad2D_0      | Averege diff | Diff stdev | Quantile #1 | Quantile #2 | Quantile #3 |
| ------------------------------------------------------------------------------------ | ------------------ | ------------- | ------------- | ------------- | -------------- | ------------- | ------------- | ------------ | ------------- | ------------- | ----------------- | ------------ | ------------ | ------------ | ------------ | ---------- | ----------- | ----------- | ----------- |
| Baseline                                                                             | 416(+0.00%)        | 704(+0.00%)   | 896(+0.00%)   | 1152(+0.00%)  | 864(+0.00%)    | 864(+0.00%)   | 448(+0.00%)   | 608(+0.00%)  | 480(+0.00%)   | 352(+0.00%)   | 512(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +0.00%       | 0.00       | +0.00%      | +0.00%      | +0.00%      |
| Combine INSERT_SUBREG + propagate latencies in SWP                                   | 416(+0.00%)        | 704(+0.00%)   | 448(-50.00%)  | 960(-16.67%)  | 416(-51.85%)   | 416(-51.85%)  | 384(-14.29%)  | 736(+21.05%) | 480(+0.00%)   | 352(+0.00%)   | 512(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -11.69%      | 23.02      | -25.00%     | +0.00%      | +0.00%      |
| Always track reg pressure in PreMISched                                              | 416(+0.00%)        | 704(+0.00%)   | 448(+0.00%)   | 960(+0.00%)   | 416(+0.00%)    | 416(+0.00%)   | 384(+0.00%)   | 736(+0.00%)  | 480(+0.00%)   | 352(+0.00%)   | 512(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +0.00%       | 0.00       | +0.00%      | +0.00%      | +0.00%      |
| More accurate PressureChange computation + delay instructions likely to cause spills | 416(+0.00%)        | 256(-63.64%)  | 448(+0.00%)   | 704(-26.67%)  | 416(+0.00%)    | 416(+0.00%)   | 256(-33.33%)  | 608(-17.39%) | 608(+26.67%)  | 224(-36.36%)  | 320(-37.50%)      | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -13.44%      | 23.34      | -34.09%     | +0.00%      | +0.00%      |
| Estimate RegPressure in SWP and increase II if necessary                             | 416(+0.00%)        | 256(+0.00%)   | 320(-28.57%)  | 416(-40.91%)  | 288(-30.77%)   | 288(-30.77%)  | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -9.36%       | 15.58      | -29.12%     | +0.00%      | +0.00%      |
| Do not block a whole cycle for instrs with an unknown slot                           | 416(+0.00%)        | 256(+0.00%)   | 256(-20.00%)  | 352(-15.38%)  | 256(-11.11%)   | 224(-22.22%)  | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -4.91%       | 8.40       | -12.18%     | +0.00%      | +0.00%      |
| Model resource conflicts in PreMISched                                               | 416(+0.00%)        | 256(+0.00%)   | 256(+0.00%)   | 384(+9.09%)   | 256(+0.00%)    | 224(+0.00%)   | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +0.65%       | 2.43       | +0.00%      | +0.00%      | +0.00%      |
| Run coalescer again after PreMISched                                                 | 416(+0.00%)        | 256(+0.00%)   | 320(+25.00%)  | 448(+16.67%)  | 320(+25.00%)   | 288(+28.57%)  | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +6.80%       | 11.42      | +0.00%      | +0.00%      | +18.75%     |
| Total diff                                                                           | SAME(+0.00%)       | IMPR(-63.64%) | IMPR(-64.29%) | IMPR(-61.11%) | IMPR(-62.96%)  | IMPR(-66.67%) | IMPR(-42.86%) | SAME(+0.00%) | REGR(+26.67%) | IMPR(-36.36%) | IMPR(-37.50%)     | SAME(+0.00%) | SAME(+0.00%) | SAME(+0.00%) | -29.19%      | 32.43      | -63.13%     | -36.93%     | +0.00%      |

}

/// Look for INSERT_SUBREG that can be rewritten as REG_SEQUENCE
bool combineINSERT_SUBREG(MachineBasicBlock &MBB) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice combine pattern!

}))
continue;

// Find the max latency one can "move" from predecessors to successors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused at this point. Here the comment says that we are looking for max latency, but in fact we are searching for the min latency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that confusing myself 😆 I essentially want to find the maximum "amount of latency" that I can move from predecessors to successors. Given that I do not want to make latencies negative, I can only subtract the min of all predecessor latencies. I'd be happy to find a better way to rephrase that :) I can also add examples, it's mostly useful for REG_SEQUENCE at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common predecessor latency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could create a new edge with the effective latency for each pair of in- and out- edges and make all incoming latencies zero.

// The default policy is to avoid tracking pressure for "small regions". For
// AIE, it is critical to estimate the pressure everywhere, especially small
// loops. Spills are very expensive.
Policy.ShouldTrackPressure = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it will be useful to have a hidden command line option disabling this? I think it can help the comparison without a rebuild, as some regression can be expected at this moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always happy to add more options

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in the form of the value for 'small' ?

if (!U.isReg() || !U.getReg().isVirtual())
continue;
LaneBitmask LiveLanes =
LiveRegs.contains(U.getReg()) & (~DefinedRegs.contains(U.getReg()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a comment here saying that we are not in SSA anymore. When I see virtual regs I start to think in SSA mode, which is not the case here. I think it is just a small clarification.

return PDiff;
}

PressureChange getPressureChange(const PressureDiff &PD, bool FindMin = true) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good candidate to the target-independent part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to see if other targets could make use of it.


SmallSet<int, 8>
AIE2RegisterInfo::getCoveringSubRegs(const TargetRegisterClass &RC) const {
// TODO: This could be generated from TableGen by looking at MCRegisters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shocking that this doesn't exist. I guess we could also use this in spill code expansion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spill code expansion is a bit different as it already deals with physical registers

bool AIEPreRASchedStrategy::isAvailableNode(SUnit &SU, SchedBoundary &Zone,
bool /*VerifyReadyCycle*/) const {
// Force verifying if SU is ready to be scheduled in terms of cycle.
return MachineSchedStrategy::isAvailableNode(SU, Zone,
/*VerifyReadyCycle=*/true);
bool Avail = MachineSchedStrategy::isAvailableNode(SU, Zone,
Copy link
Collaborator

@andcarminati andcarminati May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: const bool

// The node can be scheduled, but check if it increases the pressure too much.
// If so, try to delay it until another instruction decreases the pressure.
const RegPressureTracker &BotRPT = DAG->getBotRPTracker();
PressureChange WorstPC =
Copy link
Collaborator

@andcarminati andcarminati May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: const PressureChange

// Recursively traverse INSERT_SUBREG chains in a same MBB.
std::function<void(const MachineInstr &)> Impl = [&](const MachineInstr &MI) {
assert(MI.getOpcode() == TargetOpcode::INSERT_SUBREG);
Subregs.try_emplace(MI.getOperand(3).getImm(), MI.getOperand(2).getReg());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps have a comment with the INSERT_SUBREG signature.

return true;
}

unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[WorstPC.getPSet()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: const unsigned CurrPressure

}

for (const SUnit *PendingSU : Zone.Pending) {
PressureDiff PDiff = estimatedPressureDiff(*PendingSU, BotRPT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: const PressureDiff PDiff

for (const auto &[SubregIdx, Reg] : Subregs) {
MIB.addReg(Reg).addImm(SubregIdx);
}
MI.eraseFromParent();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could theoretically be part of another INSERT_SUBREG chain, which would then need to recognize INSERT_SUBREG on top of the newly created REQ_SEQUENCE. Not worth it probably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about it, so far it's good enough for most benchmarks.

unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()];
unsigned Threshold =
TRI->getRegPressureSetLimit(*CurMBB->getParent(), PC.getPSet());
return Threshold <= 4 || CurrPressure >= Threshold - 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this number 4 a tuning option? If yes, we could have an option to change it...

assert(MI.getOpcode() == TargetOpcode::INSERT_SUBREG);
Subregs.try_emplace(MI.getOperand(3).getImm(), MI.getOperand(2).getReg());
MachineInstr &SrcMI = *MRI.getVRegDef(MI.getOperand(1).getReg());
if (SrcMI.getParent() == MI.getParent() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the basic block restriction? We're only rewriting the top INSERT_SUBREG and leave the reset to DCE. I guess it would just work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid of loop nests. I would not want to rewrite INSERT_SUBREG instructions that have different nesting levels

auto IsNearCritical = [&](const PressureChange &PC) {
if (!PC.isValid())
return false;
unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

const unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()];
const unsigned Threshold...

TRI->getRegPressureSetLimit(*CurMBB->getParent(), PC.getPSet());
return Threshold <= 4 || CurrPressure >= Threshold - 4;
};
PressureChange TryCandPC =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can be const as well.

return true;
}

bool AIEPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it would be possible to add a test that just runs the machine scheduler and presents an easier way to see the effects of this change. Actually, the effects can be seen only indirectly through other tests. The changed tests can give an insight for it....


// Only look at COPY and REG_SEQUENCE if requested
if (OnlyCopyLike && !MI.isCopy() &&
MI.getOpcode() != TargetOpcode::REG_SEQUENCE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess EXTRACT_SUBREG could have the same treatment. Or whatever it is that splits registers for e.g. multi-reg store.


// Only look at COPY and REG_SEQUENCE if requested
if (OnlyCopyLike && !MI.isCopy() &&
MI.getOpcode() != TargetOpcode::REG_SEQUENCE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, me may want to ignore cross reg-bank copies.

unsigned NumRegionInstrs) const {
// The default policy is to avoid tracking pressure for "small regions". For
// AIE, it is critical to estimate the pressure everywhere, especially small
// loops. Spills are very expensive.
Copy link
Collaborator

@martien-de-jong martien-de-jong May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right. I guess 'small' is defined by some absolute constant that defines it to match some architecture's wishes. I guess a better interface would pass in the region and let you dynamically decide on the interesting pressure classes.

return true;
}

// Bias PhysReg Defs and copies to their uses and defined respectively.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: defines, or defs

if (Pressure.MaxSetPressure[I] > Limit) {
LLVM_DEBUG(dbgs() << TRI->getRegPressureSetName(I) << " Limit " << Limit
<< " Actual " << Pressure.MaxSetPressure[I] << "\n");
PressureExcess = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true immediately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted a chance to debug-print all the critical pressure sets

if (isNoHazardMetaInstruction(Instr->getOpcode())) {
MetaInstrs.push_back(Instr);
return;
}
// Check if the pre-condition is ensured
assert(!isStandalone() &&
assert((!ComputeSlots || !isStandalone()) &&
"Tried to add an instruction in a standalone Bundle");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an intuitive feeling that we should have a corresponding change in canAdd, similar to the handling of isStandAlone()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not query canAdd anymore in the scheduler, that's why I didn't add it. For symmetry, I can do so. Should I? :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That's a "no, don't.")

@@ -60,6 +60,9 @@ static cl::opt<bool>
static cl::opt<bool>
AllocateMRegsFirst("aie-mod-ra-first", cl::Hidden, cl::init(false),
cl::desc("Allocate M registers first in staged RA."));
static cl::opt<bool> EnablePreMISchedCoalescer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ambiguous name, might be construed as "Coalescer running before MISched"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I rename to --aie-coalescer-after-premisched maybe?


// Pre-RA scheduling might have exposed simplifiable copies.
if (EnablePreMISchedCoalescer)
addPass(&RegisterCoalescerID);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have an explicit example where it helps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does help a lot with the GEMM_bf16 kernel after all the "pressure-reducing" scheduling is done. Then this really forces greedy into allocating the same vreg and limits the number of copies. Or did you want me to add a test for RegisterCoalescer?

Copy link
Collaborator

@martien-de-jong martien-de-jong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the option explosion and the many ways we make exceptions for particular situations like 'unknown slots'.
I would love to encode 'stand-alone' instruction by reserving a 'stand-alone' slot with a unique format, and assert that we only get correct stuff.

@gbossu gbossu closed this May 28, 2024
@gbossu gbossu force-pushed the gaetan.pipeline.regpressure branch from bf0a949 to d095f1b Compare May 28, 2024 10:04
@gbossu gbossu reopened this May 28, 2024
@gbossu
Copy link
Collaborator Author

gbossu commented May 28, 2024

I think I have addressed most of the comments in !fixup commits, please have a look @martien-de-jong @andcarminati :)

@@ -337,6 +344,119 @@ DownCountLoop::Assessment DownCountLoop::accept(MachineInstr *EndLoop) {
return Assessment::Accept;
}

/// Get an instruction sequence from an \p SMS schedule that is estimated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: a \p SMS

@andcarminati
Copy link
Collaborator

I can see that this PR presents really promising results, especially for some innermost loops. It is also very positive for stack usage reduction, which shows that the pressure information is definitely important. GEMM_int8_0 is a good case to take a look at in the future (for stack).

@gbossu
Copy link
Collaborator Author

gbossu commented May 29, 2024

I'm worried about the option explosion and the many ways we make exceptions for particular situations like 'unknown slots'. I would love to encode 'stand-alone' instruction by reserving a 'stand-alone' slot with a unique format, and assert that we only get correct stuff.

Discussed offline:

  • Regarding the options, this is on purpose so we can quickly tweak the modelling without recompiling. I moved the computation for some of those options into the constructor of AIEHazardRecognizer so that it is now done in a single place.
  • Regarding the "unknown slots", this is indeed changing one exception (consider those instructions as standalone, i.e. they can't be bundled with other instructions) for another (allowing them to be added to any Bundle). I think this also shows that AIEBundle isn't right anymore for the purpose of the MachineScheduler (at least the premisched where there are still "generic" instructions). I think we'll converge on something better as we really start to work on the premisched.

Copy link
Collaborator

@andcarminati andcarminati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

gbossu added 9 commits May 29, 2024 13:50
This is in particular useful for the SW pipeliner, as it simplifies the
DAG and makes the inputs to the REG_SEQUENCE more likely to end up in
the same stage.
This is mainly useful for the SW pipeliner to keep instructions like
REG_SEQUENCE in the same stage as their inputs. This then helps reducing
the number of COPY instructions.
This also excludes some compound reg classes from the pressure sets
computation. This is because SPARSE registers are pairs of X and Q
registers. As there are only 4 Q registers, this causes the pressure
threshold for QX (SPARSE) registers to be pretty low, and having live X
registers would essentially always cause it to be exceeded.

The commit had generally a slightly negative effect on QoR, but the
performance will be regained when tracking pressure more finely in
a future commit.
This uses the current live registers to compute the pressure changes of
candidates. If an instruction is likely to cause spills and another
pending instruction can help reduce the pressure, then the former is
delayed.
This can be used to disregard schedules that have e.g. too much register
pressure.
If it is estimated the RA pipeline will spill, then the II is increased.
This will typically increase the number of stages and the number of
registers that need to be carried between stages.
This does two things:
1. Do not block a whole cycle for instructions with an unknown VLIW
   slot. Typically those are COPY instructions.
   This can be tweaked with --aie-premisched-ignore-unknown-slots=0/1

2. Track scoreboard conflicts. This can hurt by delaying instructions
   that require late resources due to the MachineScheduler only
   inserting instructions in the current cycle. On average, this brings
   QoR improvements though.
   Tweakable with --aie-premisched-fu-depth=int
@gbossu gbossu force-pushed the gaetan.pipeline.regpressure branch from 639a6d2 to b1f26fa Compare May 29, 2024 12:52
@gbossu gbossu merged commit dd36baf into aie-public May 29, 2024
7 of 8 checks passed
@gbossu gbossu deleted the gaetan.pipeline.regpressure branch May 29, 2024 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants