-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011
base: develop
Are you sure you want to change the base?
Conversation
Thank you - I started working on this but in assembly (based on the example in the developer docs) but predictably got lost :/ |
Add a new target, ARMV9SME, for Arm®v9-A architecture systems that support the Scalable Matrix Extension (SME) [1]. Initially inherits ARMV8SVE settings with updated compiler flags. This target can only be built with an SME-capable toolchain such as GCC 14 or LLVM 19. Includes some initial FEAT_SME2 feature detection on Linux targets via hwcaps. Target is disabled in DYNAMIC_ARCH builds by default. This is intended as a base target for SME2 kernels. [1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2
Add implementation of SGEMM based on the Arm®v9-A architecture Scalable Matrix Extension (SME) [1], using the Arm C Language Extensions (ACLE) [2]. Add SME2 compute & packing kernels for SGEMM and enable them under the ARMV9SME target. The compute kernel performs outer products on panels of A and B, accumulating into 2x2 inner blocks of C via the SME two-dimensional architectural register, ZA. The non-transpose packing kernel performs a copy into a contiguous buffer using SVE loads & stores in Streaming SVE mode. Streaming SVE is an execution mode introduced by SME that supports execution of SVE code with the SME defined vector length, known as the Streaming SVE vector length (SVL). The transpose packing kernel performs on-the-fly transposition by utilizing horizontal & vertical tile slice access to the SME ZA register. Includes an update to the driver to account for expanded inner block. Note: this places the ARMV9SME target in WIP state. It is functional for SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not been updated to match the larger kernel size, so SYMM/TRMM tests are currently expected to fail in this WIP state. [1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2 [2] https://arm-software.github.io/acle/main/acle.html
@martin-frbg Thanks! The CI failures on MacOS should hopefully be sorted now, though there might just be a timeout on the Windows run that maybe needs re-running. Indeed, I don't think there's any systems available in the current CI setup that can functionally test this, but it's good to see the target built. With regards to a Out of interest: once it's been through review, are we able to merge this code in as a WIP target (disabled by default) with some routines left non-functional, or would you rather it left as an open PR for the time being? If we do merge this as-is, later PRs can fill out the missing functionality and improve performance. |
Thank you for fixing the Apple jobs. The one Windows job timing out is a semi-regular annoyance caused by the heterogenicity of the Azure cloud - sometimes the CI job gets scheduled on some old hardware that cannot complete the compilation within the allocated hour. I'm all for merging your work as soon as possible, and I'm currently trying to see if it is possible to separate the SSYMM and STRMM implementations from SGEMM, redirecting them to existing implementations. (There is ample precedent in for TRMM, but I'm not sure if it can be made to work for SYMM too). |
Thanks, that makes sense! |
The idea is to set USE_TRMM=1 for your new target core in kernel/Makefile.L3 and specify a separate source file for the STRMMKERNEL in KERNEL.ARMV9SME, with the limitation that the TRMM one needs to use the same GEMM_UNROLL_M and N parameters as its GEMM companion. There is a 1x8 strmm_kernel_sve but I gather that it is not trivially easy to use it in streaming SVE mode without also changing its register use (?) , so kernel/generic/trmmkernel4x8.c would be the one to try. |
Hmm, on second thought not sure if we can get at the non-sme copy routines we'd need for that generic TRMM kernel to work alongside the SME GEMM. |
Add implementation of SGEMM based on the Arm®v9-A architecture Scalable Matrix Extension (SME), using the Arm C Language Extensions (ACLE).
Includes addition of a new target, ARMV9SME, for generic SME2 targets. This new target inherits existing ARMV8SVE settings by default. It can only be build using an SME-capable toolchain such as GCC 14 or LLVM 19.
The SME2 kernel performs outer products on panels of A and B, accumulating into 2x2 inner blocks of C via the SME two-dimensional architectural register, ZA.
Note: this is a WIP target. It is functional for SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not been updated to match the larger kernel size, so SYMM/TRMM tests are currently expected to fail in this WIP state.