Track register pressure in SWP and PreMISched #49

gbossu · 2024-05-23T09:10:27Z

The point of this PR is really to pay more attention to register pressure and try our best to avoid spills. This is in particular very useful for SW pipelined loops where the reg pressure is very high.

TODO:

~~"unit-test" the new PreMISched feature to delay instructions until the pressure goes back down~~

QoR looks good and we now reached our Q2 goals. Obviously if it hard to find the best set of options that work with all benchmarks, but I believe the current state strikes a good balance.

Useful customization options are: --aie-premisched-ignore-unknown-slots=0/1, --aie-premisched-fu-depth=int, aie-pipeliner-track-regpressure=0/1, --aie-premisched-coalescer=0/1

I do not know why github displays .md tables so badly, the "preview" works and adds a scrollbar, but the actual render does not. 😞In the meantime I'll keep the tables inside a code block


| Core_Insn_Count                                                                      | Add2D_Standalone_0 | AvgPool2D_0   | Conv2D_0      | Conv2D_3       | Conv2D_LReLU_0 | Conv2D_ReLU_0 | Conv2D_casc_2 | GEMM_bf16_0   | GEMM_int8_0  | GEMV_0       | GlobalAvgPool2D_0 | MaxPool2D_0  | Mul2D_0      | Pad2D_0      | Averege diff | Diff stdev | Quantile #1 | Quantile #2 | Quantile #3 |
| ------------------------------------------------------------------------------------ | ------------------ | ------------- | ------------- | -------------- | -------------- | ------------- | ------------- | ------------- | ------------ | ------------ | ----------------- | ------------ | ------------ | ------------ | ------------ | ---------- | ----------- | ----------- | ----------- |
| Baseline                                                                             | 3224(+0.00%)       | 3299(+0.00%)  | 10943(+0.00%) | 12337(+0.00%)  | 3094(+0.00%)   | 1716(+0.00%)  | 11935(+0.00%) | 4670(+0.00%)  | 4307(+0.00%) | 689(+0.00%)  | 490(+0.00%)       | 2483(+0.00%) | 1643(+0.00%) | 2190(+0.00%) | +0.00%       | 0.00       | +0.00%      | +0.00%      | +0.00%      |
| Combine INSERT_SUBREG + propagate latencies in SWP                                   | 3224(+0.00%)       | 3299(+0.00%)  | 10513(-3.93%) | 13111(+6.27%)  | 3064(-0.97%)   | 1706(-0.58%)  | 11359(-4.83%) | 4414(-5.48%)  | 4307(+0.00%) | 689(+0.00%)  | 490(+0.00%)       | 2483(+0.00%) | 1643(+0.00%) | 2190(+0.00%) | -0.68%       | 2.81       | -1.71%      | +0.00%      | +0.00%      |
| Always track reg pressure in PreMISched                                              | 3224(+0.00%)       | 3299(+0.00%)  | 10617(+0.99%) | 13104(-0.05%)  | 3075(+0.36%)   | 1719(+0.76%)  | 11351(-0.07%) | 4414(+0.00%)  | 4307(+0.00%) | 689(+0.00%)  | 490(+0.00%)       | 2483(+0.00%) | 1643(+0.00%) | 2190(+0.00%) | +0.14%       | 0.33       | +0.00%      | +0.00%      | +0.09%      |
| More accurate PressureChange computation + delay instructions likely to cause spills | 3225(+0.03%)       | 2840(-13.91%) | 10378(-2.25%) | 10361(-20.93%) | 3093(+0.59%)   | 1704(-0.87%)  | 11221(-1.15%) | 4326(-1.99%)  | 4253(-1.25%) | 656(-4.79%)  | 477(-2.65%)       | 2483(+0.00%) | 1638(-0.30%) | 2189(-0.05%) | -3.54%       | 6.20       | -3.19%      | -1.20%      | -0.03%      |
| Estimate RegPressure in SWP and increase II if necessary                             | 3225(+0.00%)       | 2840(+0.00%)  | 10378(+0.00%) | 9610(-7.25%)   | 3093(+0.00%)   | 1704(+0.00%)  | 11221(+0.00%) | 4326(+0.00%)  | 4253(+0.00%) | 656(+0.00%)  | 477(+0.00%)       | 2483(+0.00%) | 1638(+0.00%) | 2189(+0.00%) | -0.52%       | 1.94       | +0.00%      | +0.00%      | +0.00%      |
| Do not block a whole cycle for instrs with an unknown slot                           | 3225(+0.00%)       | 2840(+0.00%)  | 10387(+0.09%) | 9206(-4.20%)   | 3098(+0.16%)   | 1695(-0.53%)  | 10600(-5.53%) | 4638(+7.21%)  | 4393(+3.29%) | 651(-0.76%)  | 478(+0.21%)       | 2483(+0.00%) | 1689(+3.11%) | 2190(+0.05%) | +0.22%       | 3.05       | -0.59%      | +0.02%      | +0.94%      |
| Model resource conflicts in PreMISched                                               | 3229(+0.12%)       | 2811(-1.02%)  | 10473(+0.83%) | 9207(+0.01%)   | 3105(+0.23%)   | 1704(+0.53%)  | 10646(+0.43%) | 4528(-2.37%)  | 4347(-1.05%) | 650(-0.15%)  | 479(+0.21%)       | 2449(-1.37%) | 1686(-0.18%) | 2184(-0.27%) | -0.29%       | 0.87       | -1.03%      | -0.07%      | +0.28%      |
| Run coalescer again after PreMISched                                                 | 3229(+0.00%)       | 2811(+0.00%)  | 10543(+0.67%) | 8839(-4.00%)   | 3095(-0.32%)   | 1704(+0.00%)  | 10622(-0.23%) | 4160(-8.13%)  | 4347(+0.00%) | 650(+0.00%)  | 479(+0.00%)       | 2449(+0.00%) | 1686(+0.00%) | 2184(+0.00%) | -0.86%       | 2.36       | -0.25%      | +0.00%      | +0.00%      |
| Total diff                                                                           | REGR(+0.16%)       | IMPR(-14.79%) | IMPR(-3.66%)  | IMPR(-28.35%)  | SAME(+0.03%)   | IMPR(-0.70%)  | IMPR(-11.00%) | IMPR(-10.92%) | REGR(+0.93%) | IMPR(-5.66%) | IMPR(-2.24%)      | IMPR(-1.37%) | REGR(+2.62%) | IMPR(-0.27%) | -5.37%       | 8.41       | -10.94%     | -1.81%      | +0.06%      |


|Innemost loop cycles                                                                  | GlobalAvgPool2D_0 | Conv2D_casc_2 | GEMM_bf16_0 | GEMM_int8_0 | Add2D_Standalone_0 | GEMV_0 | Mul2D_0 | AvgPool2D_0 | Pad2D_0 | MaxPool2D_0 | Conv2D_0 | Conv2D_3 | Conv2D_ReLU_0 | Conv2D_LReLU_0 |
| ------------------------------------------------------------------------------------ | ----------------- | ------------- | ----------- | ----------- | ------------------ | ------ | ------- | ----------- | ------- | ----------- | -------- | -------- | ------------- | -------------- |
| Baseline                                                                             | 18                | 15            | 24          | 42          | 43                 | 51     | 82      | 96          | 65      | 63          | 11       | 22       | 11            | 11             |
| Combine INSERT_SUBREG + propagate latencies in SWP                                   | 18                | 14            | 21          | 42          | 43                 | 51     | 82      | 96          | 65      | 63          | 10       | 24       | 10            | 10             |
| Always track reg pressure in PreMISched                                              | 18                | 14            | 21          | 42          | 43                 | 51     | 82      | 96          | 65      | 63          | 11       | 24       | 11            | 11             |
| More accurate PressureChange computation + delay instructions likely to cause spills | 18                | 14            | 21          | 42          | 43                 | 45     | 82      | 77          | 65      | 63          | 11       | 17       | 11            | 11             |
| Estimate RegPressure in SWP and increase II if necessary                             | 18                | 14            | 21          | 42          | 43                 | 45     | 82      | 77          | 65      | 63          | 11       | 15       | 11            | 11             |
| Do not block a whole cycle for instrs with an unknown slot                           | 18                | 13            | 23          | 42          | 43                 | 45     | 85      | 77          | 65      | 63          | 10       | 14       | 10            | 10             |
| Model resource conflicts in PreMISched                                               | 18                | 13            | 23          | 42          | 43                 | 45     | 85      | 77          | 65      | 63          | 10       | 14       | 10            | 10             |
| Run coalescer again after PreMISched                                                 | 18                | 13            | 18          | 42          | 43                 | 45     | 85      | 77          | 65      | 63          | 10       | 13       | 10            | 10             |


| Core_StackSize                                                                       | Add2D_Standalone_0 | AvgPool2D_0   | Conv2D_0      | Conv2D_3      | Conv2D_LReLU_0 | Conv2D_ReLU_0 | Conv2D_casc_2 | GEMM_bf16_0  | GEMM_int8_0   | GEMV_0        | GlobalAvgPool2D_0 | MaxPool2D_0  | Mul2D_0      | Pad2D_0      | Averege diff | Diff stdev | Quantile #1 | Quantile #2 | Quantile #3 |
| ------------------------------------------------------------------------------------ | ------------------ | ------------- | ------------- | ------------- | -------------- | ------------- | ------------- | ------------ | ------------- | ------------- | ----------------- | ------------ | ------------ | ------------ | ------------ | ---------- | ----------- | ----------- | ----------- |
| Baseline                                                                             | 416(+0.00%)        | 704(+0.00%)   | 896(+0.00%)   | 1152(+0.00%)  | 864(+0.00%)    | 864(+0.00%)   | 448(+0.00%)   | 608(+0.00%)  | 480(+0.00%)   | 352(+0.00%)   | 512(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +0.00%       | 0.00       | +0.00%      | +0.00%      | +0.00%      |
| Combine INSERT_SUBREG + propagate latencies in SWP                                   | 416(+0.00%)        | 704(+0.00%)   | 448(-50.00%)  | 960(-16.67%)  | 416(-51.85%)   | 416(-51.85%)  | 384(-14.29%)  | 736(+21.05%) | 480(+0.00%)   | 352(+0.00%)   | 512(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -11.69%      | 23.02      | -25.00%     | +0.00%      | +0.00%      |
| Always track reg pressure in PreMISched                                              | 416(+0.00%)        | 704(+0.00%)   | 448(+0.00%)   | 960(+0.00%)   | 416(+0.00%)    | 416(+0.00%)   | 384(+0.00%)   | 736(+0.00%)  | 480(+0.00%)   | 352(+0.00%)   | 512(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +0.00%       | 0.00       | +0.00%      | +0.00%      | +0.00%      |
| More accurate PressureChange computation + delay instructions likely to cause spills | 416(+0.00%)        | 256(-63.64%)  | 448(+0.00%)   | 704(-26.67%)  | 416(+0.00%)    | 416(+0.00%)   | 256(-33.33%)  | 608(-17.39%) | 608(+26.67%)  | 224(-36.36%)  | 320(-37.50%)      | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -13.44%      | 23.34      | -34.09%     | +0.00%      | +0.00%      |
| Estimate RegPressure in SWP and increase II if necessary                             | 416(+0.00%)        | 256(+0.00%)   | 320(-28.57%)  | 416(-40.91%)  | 288(-30.77%)   | 288(-30.77%)  | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -9.36%       | 15.58      | -29.12%     | +0.00%      | +0.00%      |
| Do not block a whole cycle for instrs with an unknown slot                           | 416(+0.00%)        | 256(+0.00%)   | 256(-20.00%)  | 352(-15.38%)  | 256(-11.11%)   | 224(-22.22%)  | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | -4.91%       | 8.40       | -12.18%     | +0.00%      | +0.00%      |
| Model resource conflicts in PreMISched                                               | 416(+0.00%)        | 256(+0.00%)   | 256(+0.00%)   | 384(+9.09%)   | 256(+0.00%)    | 224(+0.00%)   | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +0.65%       | 2.43       | +0.00%      | +0.00%      | +0.00%      |
| Run coalescer again after PreMISched                                                 | 416(+0.00%)        | 256(+0.00%)   | 320(+25.00%)  | 448(+16.67%)  | 320(+25.00%)   | 288(+28.57%)  | 256(+0.00%)   | 608(+0.00%)  | 608(+0.00%)   | 224(+0.00%)   | 320(+0.00%)       | 160(+0.00%)  | 192(+0.00%)  | 160(+0.00%)  | +6.80%       | 11.42      | +0.00%      | +0.00%      | +18.75%     |
| Total diff                                                                           | SAME(+0.00%)       | IMPR(-63.64%) | IMPR(-64.29%) | IMPR(-61.11%) | IMPR(-62.96%)  | IMPR(-66.67%) | IMPR(-42.86%) | SAME(+0.00%) | REGR(+26.67%) | IMPR(-36.36%) | IMPR(-37.50%)     | SAME(+0.00%) | SAME(+0.00%) | SAME(+0.00%) | -29.19%      | 32.43      | -63.13%     | -36.93%     | +0.00%      |

andcarminati · 2024-05-27T07:43:51Z

llvm/lib/Target/AIE/AIEPostSelectOptimize.cpp

+}
+
+/// Look for INSERT_SUBREG that can be rewritten as REG_SEQUENCE
+bool combineINSERT_SUBREG(MachineBasicBlock &MBB) {


Nice combine pattern!

llvm/test/CodeGen/AIE/aie2/schedule/pre_ra/transitive.mir

andcarminati · 2024-05-27T07:58:06Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+          }))
+        continue;
+
+      // Find the max latency one can "move" from predecessors to successors


I'm a bit confused at this point. Here the comment says that we are looking for max latency, but in fact we are searching for the min latency.

I found that confusing myself 😆 I essentially want to find the maximum "amount of latency" that I can move from predecessors to successors. Given that I do not want to make latencies negative, I can only subtract the min of all predecessor latencies. I'd be happy to find a better way to rephrase that :) I can also add examples, it's mostly useful for REG_SEQUENCE at this point.

common predecessor latency?

Alternatively, we could create a new edge with the effective latency for each pair of in- and out- edges and make all incoming latencies zero.

andcarminati · 2024-05-27T08:50:09Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+  // The default policy is to avoid tracking pressure for "small regions". For
+  // AIE, it is critical to estimate the pressure everywhere, especially small
+  // loops. Spills are very expensive.
+  Policy.ShouldTrackPressure = true;


Do you think it will be useful to have a hidden command line option disabling this? I think it can help the comparison without a rebuild, as some regression can be expected at this moment.

I'm always happy to add more options

Perhaps in the form of the value for 'small' ?

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

andcarminati · 2024-05-27T09:22:40Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+    if (!U.isReg() || !U.getReg().isVirtual())
+      continue;
+    LaneBitmask LiveLanes =
+        LiveRegs.contains(U.getReg()) & (~DefinedRegs.contains(U.getReg()));


I think it would be nice to have a comment here saying that we are not in SSA anymore. When I see virtual regs I start to think in SSA mode, which is not the case here. I think it is just a small clarification.

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

andcarminati · 2024-05-27T09:54:54Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+  return PDiff;
+}
+
+PressureChange getPressureChange(const PressureDiff &PD, bool FindMin = true) {


I think this is a good candidate to the target-independent part.

I'll try to see if other targets could make use of it.

martien-de-jong · 2024-05-27T10:01:57Z

llvm/lib/Target/AIE/AIE2RegisterInfo.cpp

+
+SmallSet<int, 8>
+AIE2RegisterInfo::getCoveringSubRegs(const TargetRegisterClass &RC) const {
+  // TODO: This could be generated from TableGen by looking at MCRegisters.


Shocking that this doesn't exist. I guess we could also use this in spill code expansion?

Spill code expansion is a bit different as it already deals with physical registers

andcarminati · 2024-05-27T10:06:11Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

 bool AIEPreRASchedStrategy::isAvailableNode(SUnit &SU, SchedBoundary &Zone,
                                            bool /*VerifyReadyCycle*/) const {
  // Force verifying if SU is ready to be scheduled in terms of cycle.
-  return MachineSchedStrategy::isAvailableNode(SU, Zone,
-                                               /*VerifyReadyCycle=*/true);
+  bool Avail = MachineSchedStrategy::isAvailableNode(SU, Zone,


NIT: const bool

andcarminati · 2024-05-27T10:07:04Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+  // The node can be scheduled, but check if it increases the pressure too much.
+  // If so, try to delay it until another instruction decreases the pressure.
+  const RegPressureTracker &BotRPT = DAG->getBotRPTracker();
+  PressureChange WorstPC =


NIT: const PressureChange

martien-de-jong · 2024-05-27T10:07:48Z

llvm/lib/Target/AIE/AIEPostSelectOptimize.cpp

+  // Recursively traverse INSERT_SUBREG chains in a same MBB.
+  std::function<void(const MachineInstr &)> Impl = [&](const MachineInstr &MI) {
+    assert(MI.getOpcode() == TargetOpcode::INSERT_SUBREG);
+    Subregs.try_emplace(MI.getOperand(3).getImm(), MI.getOperand(2).getReg());


Perhaps have a comment with the INSERT_SUBREG signature.

andcarminati · 2024-05-27T10:07:54Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+    return true;
+  }
+
+  unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[WorstPC.getPSet()];


NIT: const unsigned CurrPressure

andcarminati · 2024-05-27T10:09:10Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+  }
+
+  for (const SUnit *PendingSU : Zone.Pending) {
+    PressureDiff PDiff = estimatedPressureDiff(*PendingSU, BotRPT);


NIT: const PressureDiff PDiff

martien-de-jong · 2024-05-27T10:16:14Z

llvm/lib/Target/AIE/AIEPostSelectOptimize.cpp

+    for (const auto &[SubregIdx, Reg] : Subregs) {
+      MIB.addReg(Reg).addImm(SubregIdx);
+    }
+    MI.eraseFromParent();


I guess this could theoretically be part of another INSERT_SUBREG chain, which would then need to recognize INSERT_SUBREG on top of the newly created REQ_SEQUENCE. Not worth it probably.

I also thought about it, so far it's good enough for most benchmarks.

andcarminati · 2024-05-27T10:18:49Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+      unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()];
+      unsigned Threshold =
+          TRI->getRegPressureSetLimit(*CurMBB->getParent(), PC.getPSet());
+      return Threshold <= 4 || CurrPressure >= Threshold - 4;


Is this number 4 a tuning option? If yes, we could have an option to change it...

martien-de-jong · 2024-05-27T10:19:40Z

llvm/lib/Target/AIE/AIEPostSelectOptimize.cpp

+    assert(MI.getOpcode() == TargetOpcode::INSERT_SUBREG);
+    Subregs.try_emplace(MI.getOperand(3).getImm(), MI.getOperand(2).getReg());
+    MachineInstr &SrcMI = *MRI.getVRegDef(MI.getOperand(1).getReg());
+    if (SrcMI.getParent() == MI.getParent() &&


Why the basic block restriction? We're only rewriting the top INSERT_SUBREG and leave the reset to DCE. I guess it would just work.

I'm afraid of loop nests. I would not want to rewrite INSERT_SUBREG instructions that have different nesting levels

andcarminati · 2024-05-27T10:20:23Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+    auto IsNearCritical = [&](const PressureChange &PC) {
+      if (!PC.isValid())
+        return false;
+      unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()];


NIT:

const unsigned CurrPressure = BotRPT.getRegSetPressureAtPos()[PC.getPSet()]; const unsigned Threshold...

andcarminati · 2024-05-27T10:21:21Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+          TRI->getRegPressureSetLimit(*CurMBB->getParent(), PC.getPSet());
+      return Threshold <= 4 || CurrPressure >= Threshold - 4;
+    };
+    PressureChange TryCandPC =


NIT: can be const as well.

andcarminati · 2024-05-27T10:31:57Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+  return true;
+}
+
+bool AIEPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,


I was wondering if it would be possible to add a test that just runs the machine scheduler and presents an easier way to see the effects of this change. Actually, the effects can be seen only indirectly through other tests. The changed tests can give an insight for it....

martien-de-jong · 2024-05-27T10:32:45Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+
+      // Only look at COPY and REG_SEQUENCE if requested
+      if (OnlyCopyLike && !MI.isCopy() &&
+          MI.getOpcode() != TargetOpcode::REG_SEQUENCE)


I guess EXTRACT_SUBREG could have the same treatment. Or whatever it is that splits registers for e.g. multi-reg store.

martien-de-jong · 2024-05-27T11:29:29Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+
+      // Only look at COPY and REG_SEQUENCE if requested
+      if (OnlyCopyLike && !MI.isCopy() &&
+          MI.getOpcode() != TargetOpcode::REG_SEQUENCE)


Also, me may want to ignore cross reg-bank copies.

martien-de-jong · 2024-05-27T11:47:18Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+                                               unsigned NumRegionInstrs) const {
+  // The default policy is to avoid tracking pressure for "small regions". For
+  // AIE, it is critical to estimate the pressure everywhere, especially small
+  // loops. Spills are very expensive.


Yeah, right. I guess 'small' is defined by some absolute constant that defines it to match some architecture's wishes. I guess a better interface would pass in the region and let you dynamically decide on the interesting pressure classes.

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

martien-de-jong · 2024-05-27T12:33:17Z

llvm/lib/Target/AIE/AIEMachineScheduler.cpp

+    return true;
+  }
+
+  // Bias PhysReg Defs and copies to their uses and defined respectively.


nit: defines, or defs

martien-de-jong · 2024-05-27T13:01:17Z

llvm/lib/Target/AIE/AIEBasePipelinerLoopInfo.cpp

+      if (Pressure.MaxSetPressure[I] > Limit) {
+        LLVM_DEBUG(dbgs() << TRI->getRegPressureSetName(I) << " Limit " << Limit
+                          << " Actual " << Pressure.MaxSetPressure[I] << "\n");
+        PressureExcess = true;


return true immediately?

I just wanted a chance to debug-print all the critical pressure sets

martien-de-jong · 2024-05-27T14:10:28Z

llvm/lib/Target/AIE/AIEBundle.h

    if (isNoHazardMetaInstruction(Instr->getOpcode())) {
      MetaInstrs.push_back(Instr);
      return;
    }
    // Check if the pre-condition is ensured
-    assert(!isStandalone() &&
+    assert((!ComputeSlots || !isStandalone()) &&
           "Tried to add an instruction in a standalone Bundle");


I have an intuitive feeling that we should have a corresponding change in canAdd, similar to the handling of isStandAlone()

We do not query canAdd anymore in the scheduler, that's why I didn't add it. For symmetry, I can do so. Should I? :)

(That's a "no, don't.")

llvm/lib/Target/AIE/AIEHazardRecognizer.cpp

martien-de-jong · 2024-05-27T14:54:36Z

llvm/lib/Target/AIE/AIE2TargetMachine.cpp

@@ -60,6 +60,9 @@ static cl::opt<bool>
 static cl::opt<bool>
    AllocateMRegsFirst("aie-mod-ra-first", cl::Hidden, cl::init(false),
                       cl::desc("Allocate M registers first in staged RA."));
+static cl::opt<bool> EnablePreMISchedCoalescer(


nit: ambiguous name, might be construed as "Coalescer running before MISched"

Should I rename to --aie-coalescer-after-premisched maybe?

martien-de-jong · 2024-05-27T14:55:21Z

llvm/lib/Target/AIE/AIE2TargetMachine.cpp

+
+  // Pre-RA scheduling might have exposed simplifiable copies.
+  if (EnablePreMISchedCoalescer)
+    addPass(&RegisterCoalescerID);


Would you have an explicit example where it helps?

This does help a lot with the GEMM_bf16 kernel after all the "pressure-reducing" scheduling is done. Then this really forces greedy into allocating the same vreg and limits the number of copies. Or did you want me to add a test for RegisterCoalescer?

martien-de-jong

I'm worried about the option explosion and the many ways we make exceptions for particular situations like 'unknown slots'.
I would love to encode 'stand-alone' instruction by reserving a 'stand-alone' slot with a unique format, and assert that we only get correct stuff.

gbossu · 2024-05-28T10:15:33Z

I think I have addressed most of the comments in !fixup commits, please have a look @martien-de-jong @andcarminati :)

andcarminati · 2024-05-28T13:12:05Z

llvm/lib/Target/AIE/AIEBasePipelinerLoopInfo.cpp

@@ -337,6 +344,119 @@ DownCountLoop::Assessment DownCountLoop::accept(MachineInstr *EndLoop) {
  return Assessment::Accept;
 }

+/// Get an instruction sequence from an \p SMS schedule that is estimated


NIT: a \p SMS

andcarminati · 2024-05-29T07:40:55Z

I can see that this PR presents really promising results, especially for some innermost loops. It is also very positive for stack usage reduction, which shows that the pressure information is definitely important. GEMM_int8_0 is a good case to take a look at in the future (for stack).

gbossu · 2024-05-29T08:36:40Z

I'm worried about the option explosion and the many ways we make exceptions for particular situations like 'unknown slots'. I would love to encode 'stand-alone' instruction by reserving a 'stand-alone' slot with a unique format, and assert that we only get correct stuff.

Discussed offline:

Regarding the options, this is on purpose so we can quickly tweak the modelling without recompiling. I moved the computation for some of those options into the constructor of AIEHazardRecognizer so that it is now done in a single place.
Regarding the "unknown slots", this is indeed changing one exception (consider those instructions as standalone, i.e. they can't be bundled with other instructions) for another (allowing them to be added to any Bundle). I think this also shows that AIEBundle isn't right anymore for the purpose of the MachineScheduler (at least the premisched where there are still "generic" instructions). I think we'll converge on something better as we really start to work on the premisched.

andcarminati

LGTM.

llvm/lib/Target/AIE/AIEHazardRecognizer.cpp

This is in particular useful for the SW pipeliner, as it simplifies the DAG and makes the inputs to the REG_SEQUENCE more likely to end up in the same stage.

This is mainly useful for the SW pipeliner to keep instructions like REG_SEQUENCE in the same stage as their inputs. This then helps reducing the number of COPY instructions.

This also excludes some compound reg classes from the pressure sets computation. This is because SPARSE registers are pairs of X and Q registers. As there are only 4 Q registers, this causes the pressure threshold for QX (SPARSE) registers to be pretty low, and having live X registers would essentially always cause it to be exceeded. The commit had generally a slightly negative effect on QoR, but the performance will be regained when tracking pressure more finely in a future commit.

This uses the current live registers to compute the pressure changes of candidates. If an instruction is likely to cause spills and another pending instruction can help reduce the pressure, then the former is delayed.

This can be used to disregard schedules that have e.g. too much register pressure.

If it is estimated the RA pipeline will spill, then the II is increased. This will typically increase the number of stages and the number of registers that need to be carried between stages.

This does two things: 1. Do not block a whole cycle for instructions with an unknown VLIW slot. Typically those are COPY instructions. This can be tweaked with --aie-premisched-ignore-unknown-slots=0/1 2. Track scoreboard conflicts. This can hurt by delaying instructions that require late resources due to the MachineScheduler only inserting instructions in the current cycle. On average, this brings QoR improvements though. Tweakable with --aie-premisched-fu-depth=int

gbossu requested review from andcarminati, abhinay-anubola and martien-de-jong May 23, 2024 09:10

andcarminati reviewed May 27, 2024

View reviewed changes

llvm/test/CodeGen/AIE/aie2/schedule/pre_ra/transitive.mir Show resolved Hide resolved

andcarminati reviewed May 27, 2024

View reviewed changes

llvm/lib/Target/AIE/AIEMachineScheduler.cpp Outdated Show resolved Hide resolved

andcarminati reviewed May 27, 2024

View reviewed changes

gbossu commented May 27, 2024

View reviewed changes

llvm/lib/Target/AIE/AIEMachineScheduler.cpp Show resolved Hide resolved

andcarminati reviewed May 27, 2024

View reviewed changes

martien-de-jong reviewed May 27, 2024

View reviewed changes