Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce generic matrix*vector latency #56289

Merged
merged 4 commits into from
Oct 23, 2024
Merged

Reduce generic matrix*vector latency #56289

merged 4 commits into from
Oct 23, 2024

Conversation

jishnub
Copy link
Contributor

@jishnub jishnub commented Oct 22, 2024

julia> using LinearAlgebra

julia> A = rand(Int,4,4); x = rand(Int,4); y = similar(x);

julia> @time mul!(y, A, x, 2, 2);
  0.330489 seconds (792.22 k allocations: 41.519 MiB, 8.75% gc time, 99.99% compilation time) # master
  0.134212 seconds (339.89 k allocations: 17.103 MiB, 15.23% gc time, 99.98% compilation time) # This PR

Main changes:

  • generic_matvecmul! and _generic_matvecmul! now accept alpha and beta arguments instead of MulAddMul(alpha, beta). The methods that accept a MulAddMul(alpha, beta) are also retained for backward compatibility, but these now forward alpha and beta, instead of the other way around.
  • Narrow the scope of the @stable_muladdmul applications. We now construct the MulAddMul(alpha, beta) object only where it is needed in a function call, and we annotate the call site with @stable_muladdmul. This leads to smaller branches.
  • Create a new internal function with methods for the 'N', 'T' and 'C' cases, so that firstly, there's less code duplication, and secondly, the _generic_matvecmul! method is now simple enough to enable constant propagation. This eliminates the unnecessary branches, and only the one that is taken is compiled.

Together, this reduces the TTFX substantially.

@jishnub jishnub added linear algebra Linear algebra compiler:latency Compiler latency labels Oct 22, 2024
@dkarrasch
Copy link
Member

What's the effect on runtime, as this seems to introduce branches in "hot loops"?

@jishnub
Copy link
Contributor Author

jishnub commented Oct 23, 2024

There's hardly any impact, as firstly, the branches are not in the innermost loop, and secondly, they are probably hoisted out of the loop anyway.

julia> using LinearAlgebra

julia> x = rand(Int,3000); A = rand(Int,size(x,1),size(x,1)); y = similar(x);

julia> @btime mul!($y, $A, $x, 2, 2);
  4.239 ms (0 allocations: 0 bytes) # nightly
  4.143 ms (0 allocations: 0 bytes) # This PR

julia> @btime mul!($y, $A', $x, 2, 2);
  4.115 ms (0 allocations: 0 bytes) # nightly
  4.187 ms (0 allocations: 0 bytes) # This PR

@dkarrasch
Copy link
Member

Just to play safe:

@nanosoldier runbenchmarks("linalg", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@dkarrasch
Copy link
Member

The performance regressions must be "fake", right. They don't even run by the generic methods, so this LGTM.

@jishnub
Copy link
Contributor Author

jishnub commented Oct 23, 2024

Yes, these seem spurious

@jishnub jishnub merged commit b9b4dfa into master Oct 23, 2024
10 checks passed
@jishnub jishnub deleted the jishnub/matvecmul branch October 23, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants