You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running some timings on operation t(X)*X on row-major matrices having many more rows than columns.
I'm finding that for these types of inputs, function gemmt is much slower than the equivalent from syrk or gemm, with a very wide margin.
Timings in milliseconds for input size 1,000,000 x 32, intel i12700H, average of 3 runs:
gemmt: 216.178
syrk: 41.0468
gemm: 39.55553
Version: OpenBLAS 0.3.28, built with OpenMP, compiled from source (gcc with cmake system). Same issue happen with pthreads, and same timing difference is observed when running single-threaded.
The current GEMMT implementation is just a loop around GEMV, so its performance largely depends on that of the individual optimized kernels for the latter. It is provided for compatibility, but not yet optimized for speed. As the Reference BLAS looks to be adding its own Interpretation of what used to be an inofficial extension, a total rework may be necessary at some point in any case.
On ARM, the performance of this somewhat naive gemmt implementation is about on par with gemm, clearly better than syrk - provided the number of threads is capped at about 30 (on 64 cores, gemmt comes out horrendously bad again, taking about ten times as long as gemm). Interestingly the obvious optimization of allocating the memory buffer only once instead of allocating and freeing it for every individual gemv step in interface/gemmt.c does not result in significant improvement.
If someone wants to add a generic implementation with a reasonable performance, the LAPACK implementation prior to the introduction of GEMMT (aka GEMMTR in LAPACK) as part of the BLAS may serve as inspiration. It reduces the problem essentially to GEMM by blocking into panels; only the small triangular part is computed with GEMV.
I'm running some timings on operation
t(X)*X
on row-major matrices having many more rows than columns.I'm finding that for these types of inputs, function
gemmt
is much slower than the equivalent fromsyrk
orgemm
, with a very wide margin.Timings in milliseconds for input size 1,000,000 x 32, intel i12700H, average of 3 runs:
gemmt
: 216.178syrk
: 41.0468gemm
: 39.55553Version: OpenBLAS 0.3.28, built with OpenMP, compiled from source (gcc with cmake system). Same issue happen with pthreads, and same timing difference is observed when running single-threaded.
For reference, timings for other libraries:
gemmt
: 25.66533syrk
: 12.57197gemm
: 15.69447tabmat
's "sandwich" op: 29.3Code to reproduce:
The text was updated successfully, but these errors were encountered: