You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regular sub-group reduction not taking into account layouts may lead to subpar performance on PVC. This kind of workflows takes place when a reduction follows a matrix multiplication or a tensor with the same layout as the output of a matrix multiplication (DPAS layout). #2907 was the final PR trying to fix this at the Triton level. However, a parallel approach to fix it on IGC was run. The IGC approach however was still subpar as it required moving data around after doing the reduction while also using way more operations on the reduction itself.
Running FlashAttention using the SIMD reduction does not currently give good performance as, per my investigation, spilling is way higher in that case. This should not be the case as the algorithm should not increase register pressure, so maybe this is related to some kind of suboptimal instruction scheduling.
Reducing the DModel dimension to just 16 so no spilling takes place lead to better performance and overall better codegen in the SIMD reduction compared to the baseline reduction and approach. This may lead to think the SIMD reduction will give better performance (as well as being more general) as it acts in a higher level.
Now, to take full profit out of the optimization, we may have two paths:
Improving instruction scheduling in the backend
Explore splitting tensors across warps in the DModel dimension (reduction dimension). This may also alleviate register pressure and avoid spilling while exploiting the SIMD reduction
The text was updated successfully, but these errors were encountered:
When this issue is resolved and performance investigation shows good results, we can enable the pass by default in the pipeline. This can be done by either reopening #2748 or by filling a new issue.
Regular sub-group reduction not taking into account layouts may lead to subpar performance on PVC. This kind of workflows takes place when a reduction follows a matrix multiplication or a tensor with the same layout as the output of a matrix multiplication (DPAS layout). #2907 was the final PR trying to fix this at the Triton level. However, a parallel approach to fix it on IGC was run. The IGC approach however was still subpar as it required moving data around after doing the reduction while also using way more operations on the reduction itself.
Running FlashAttention using the SIMD reduction does not currently give good performance as, per my investigation, spilling is way higher in that case. This should not be the case as the algorithm should not increase register pressure, so maybe this is related to some kind of suboptimal instruction scheduling.
Reducing the DModel dimension to just 16 so no spilling takes place lead to better performance and overall better codegen in the SIMD reduction compared to the baseline reduction and approach. This may lead to think the SIMD reduction will give better performance (as well as being more general) as it acts in a higher level.
Now, to take full profit out of the optimization, we may have two paths:
The text was updated successfully, but these errors were encountered: