Reduce generic matrix*vector latency #56289

jishnub · 2024-10-22T11:11:04Z

julia> using LinearAlgebra

julia> A = rand(Int,4,4); x = rand(Int,4); y = similar(x);

julia> @time mul!(y, A, x, 2, 2);
  0.330489 seconds (792.22 k allocations: 41.519 MiB, 8.75% gc time, 99.99% compilation time) # master
  0.134212 seconds (339.89 k allocations: 17.103 MiB, 15.23% gc time, 99.98% compilation time) # This PR

Main changes:

generic_matvecmul! and _generic_matvecmul! now accept alpha and beta arguments instead of MulAddMul(alpha, beta). The methods that accept a MulAddMul(alpha, beta) are also retained for backward compatibility, but these now forward alpha and beta, instead of the other way around.
Narrow the scope of the @stable_muladdmul applications. We now construct the MulAddMul(alpha, beta) object only where it is needed in a function call, and we annotate the call site with @stable_muladdmul. This leads to smaller branches.
Create a new internal function with methods for the 'N', 'T' and 'C' cases, so that firstly, there's less code duplication, and secondly, the _generic_matvecmul! method is now simple enough to enable constant propagation. This eliminates the unnecessary branches, and only the one that is taken is compiled.

Together, this reduces the TTFX substantially.

dkarrasch · 2024-10-23T10:21:47Z

What's the effect on runtime, as this seems to introduce branches in "hot loops"?

jishnub · 2024-10-23T10:46:17Z

There's hardly any impact, as firstly, the branches are not in the innermost loop, and secondly, they are probably hoisted out of the loop anyway.

julia> using LinearAlgebra

julia> x = rand(Int,3000); A = rand(Int,size(x,1),size(x,1)); y = similar(x);

julia> @btime mul!($y, $A, $x, 2, 2);
  4.239 ms (0 allocations: 0 bytes) # nightly
  4.143 ms (0 allocations: 0 bytes) # This PR

julia> @btime mul!($y, $A', $x, 2, 2);
  4.115 ms (0 allocations: 0 bytes) # nightly
  4.187 ms (0 allocations: 0 bytes) # This PR

dkarrasch · 2024-10-23T10:59:08Z

Just to play safe:

@nanosoldier runbenchmarks("linalg", vs = ":master")

nanosoldier · 2024-10-23T11:41:01Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

dkarrasch · 2024-10-23T17:41:25Z

The performance regressions must be "fake", right. They don't even run by the generic methods, so this LGTM.

jishnub · 2024-10-23T18:33:44Z

Yes, these seem spurious

Reduce generic matrix*vector latency

8db6400

jishnub added linear algebra Linear algebra compiler:latency Compiler latency labels Oct 22, 2024

jishnub added 3 commits October 22, 2024 16:42

Trim whitespace

05d5b37

Restore _generic_matvecmul! method that accepts a MulAddMul

00c2e6f

Split branches into separate functions

ff1a06c

jishnub requested a review from dkarrasch October 23, 2024 09:58

jishnub mentioned this pull request Oct 23, 2024

Matmul: dispatch on specific blas paths using an enum #55002

Open

jishnub merged commit b9b4dfa into master Oct 23, 2024
10 checks passed

jishnub deleted the jishnub/matvecmul branch October 23, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce generic matrix*vector latency #56289

Reduce generic matrix*vector latency #56289

jishnub commented Oct 22, 2024 •

edited

Loading

dkarrasch commented Oct 23, 2024

jishnub commented Oct 23, 2024

dkarrasch commented Oct 23, 2024

nanosoldier commented Oct 23, 2024

dkarrasch commented Oct 23, 2024

jishnub commented Oct 23, 2024

Reduce generic matrix*vector latency #56289

Reduce generic matrix*vector latency #56289

Conversation

jishnub commented Oct 22, 2024 • edited Loading

dkarrasch commented Oct 23, 2024

jishnub commented Oct 23, 2024

dkarrasch commented Oct 23, 2024

nanosoldier commented Oct 23, 2024

dkarrasch commented Oct 23, 2024

jishnub commented Oct 23, 2024

jishnub commented Oct 22, 2024 •

edited

Loading