-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear solve in Float32 #196
Comments
Interestingly, in StdLib it works almost by accident: it's not converting the RHS to This is a bit annoying to replicate as we do not yet have a Julia native julia> qr(P) \ b
8-element Array{Float64,1}:
-2.3678433850975633
1.7256988915603115
3.4832536067367204
0.4916213305479896
0.5015980815362554
-8.826729518502406
8.672489866500378
13.338910315594969 Some options, from smallest change to biggest:
function ldiv!(A::BandedLU{T}, b::AbstractVector{V})
TV = promote_type(T,V)
ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b))
end This will unfortunately allocate. |
Thanks for following up. I think, at least for my application, it is ok to
demote the right hand side, return a Float32 result, which will do the
right thing when I add it to a float 64. That way I will only allocate when
I do Float32.(b). I can do this as a hack job in my own application (Newton
solver).
So I'd do A\Float32.(b) and not have to mess with the Float32 banded
matrix.
Your code, by the way, is faster and allocates far less than using
SparseArrays and SuiteSparse. I was going to try to use the LAPACK
bandsolvers until I noticed that you'd already done it.
— Tim
Interestingly, in StdLib it works almost by accident: it's *not* converting
the RHS to Float32 or LHS to Float64 and calling BLAS for the triangular
solves, but rather following back to the generic naivesub! routine.
This is a bit annoying to replicate as we do not yet have a Julia native
BandedLU. We do have a Julia native QR though, and that works:
julia> qr(P) \ b
8-element Array{Float64,1}:
-2.3678433850975633
1.7256988915603115
3.4832536067367204
0.4916213305479896
0.5015980815362554
-8.826729518502406
8.672489866500378
13.338910315594969
Some options, from smallest change to biggest:
1. Add overload
function ldiv!(A::BandedLU{T}, b::AbstractVector{V})
TV = promote_type(T,V)
ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b))
end
…On Sat, Aug 15, 2020 at 3:50 PM Sheehan Olver ***@***.***> wrote:
Interestingly, in StdLib it works almost by accident: it's *not*
converting the RHS to Float32 or LHS to Float64 and calling BLAS for the
triangular solves, but rather following back to the generic naivesub!
routine.
This is a bit annoying to replicate as we do not yet have a Julia native
BandedLU. We do have a Julia native QR though, and that works:
julia> qr(P) \ b8-element Array{Float64,1}:
-2.3678433850975633
1.7256988915603115
3.4832536067367204
0.4916213305479896
0.5015980815362554
-8.826729518502406
8.672489866500378
13.338910315594969
Some options, from smallest change to biggest:
1. Add overload
function ldiv!(A::BandedLU{T}, b::AbstractVector{V})
TV = promote_type(T,V)
ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b))end
This will unfortunately allocate.
2. Make qr the default factorisation for banded matrices. This will slow
down \, though if I remember correctly it actually is dependent on the
bandwidth which is faster.
3. Write a Julia native BandedLU ldiv!. This is easier than it sounds as
one just needs to work out the pivoting: we already have Julia native
banded triangular solves, and in this case we only need to implement the
solve, not the computation of the factorisation. But requires more effort
than I'm willing to do right now.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX66PWTWK7YP4IYSLWE3SA3RIBANCNFSM4QALJVZQ>
.
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
👍 I think you are right, a more sensible definition is function ldiv!(A::BandedLU{T}, b::AbstractVector)
c = ldiv!(A, convert(AbstractVector{T}, b))
copyto!(b, c)
end |
Good to hear. If you aren't already I recommend using MKL: many of its banded implementations are 4x faster than OpenBLAS, last I checked. |
How/where do I put
ldiv!(A::BandedLU{T}, b::AbstractVector) = ldiv!(convert(BandedLU{T},
A), convert(AbstractVector{T}, b))
in a place where my users don't have to know about it? I tested it in
the REPL and got a complaint. I can't see any submodules in your
source, so am stuck.
I'd love to use MKL, but want to test my stuff in the environment most
other people use, ie OpenBLAS.
julia> import BandedMatrices
julia> ldiv!(A::BandedLU{T}, b::AbstractVector) =
ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{T}, b))
ERROR: UndefVarError: BandedLU not defined
Stacktrace:
[1] top-level scope at REPL[5]:1
I'd love to use MKL, but want to test my stuff in the environment most
other people use, ie OpenBLAS.
— Tim
…On Sat, Aug 15, 2020 at 4:31 PM Sheehan Olver ***@***.***> wrote:
Your code, by the way, is faster and allocates far less than using
SparseArrays and SuiteSparse
Good to hear. If you aren't already I recommend using MKL: many of its
banded implementations are 4x faster than OpenBLAS, last I checked.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX626FUGXG7VMALRX2NDSA3WAJANCNFSM4QALJVZQ>
.
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
The best is to make a PR and add it to BandedLU.jl, but if you want to do it on the REPL this should work: julia> using BandedMatrices
julia> import BandedMatrices: BandedLU
julia> LinearAlgebra.ldiv!(A::BandedLU{T}, b::AbstractVector) where T = copyto!(b, ldiv!(A, convert(AbstractVector{T}, b)))
julia> P=BandedMatrix{Float32}(rand(8,8),(2,2));
julia> b=rand(8,);
julia> P\b
8-element Array{Float64,1}:
0.731564998626709
0.7191325426101685
0.6498026847839355
-2.1065616607666016
1.2697333097457886
0.6517024040222168
-0.6199814677238464
0.9412044882774353 |
Thanks. I do not have the confidence or the skill set to do a PR for this
code.
I put what you gave me in the module and all is well. Please let me know if
this ever makes it into BandedMatrices.jl so I can clean up the module.
— Tim
…On Sat, Aug 15, 2020 at 4:51 PM Sheehan Olver ***@***.***> wrote:
The best is to make a PR and add it to BandedLU.jl, but if you want to do
it on the REPL this should work:
julia> using BandedMatrices
julia> import BandedMatrices: BandedLU
julia> LinearAlgebra.ldiv!(A::BandedLU{T}, b::AbstractVector) where T = copyto!(b, ldiv!(A, convert(AbstractVector{T}, b)))
julia> P=BandedMatrix{Float32}(rand(8,8),(2,2));
julia> b=rand(8,);
julia> P\b8-element Array{Float64,1}:
0.731564998626709
0.7191325426101685
0.6498026847839355
-2.1065616607666016
1.2697333097457886
0.6517024040222168
-0.6199814677238464
0.9412044882774353
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX66LXJVLCNJ6R76KOGDSA3YM7ANCNFSM4QALJVZQ>
.
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
I've been using qr! and with larger problems I'm getting killed with allocations in the solve phase. Here's an example. The timings and allocations are very consistent over several trials.
The allocation burden is much better if I convert b to Float32, but I gain nothing in compute time over double.
|
I think there are reasonable explanations:
|
However, for tridiagonal matrices, there is no problem.
…On Tue, Nov 3, 2020 at 1:42 PM Mikael Slevinsky ***@***.***> wrote:
I think there are reasonable explanations:
- mixing precisions uses a generic fallback.
- narrow bandwidths interfere with maximizing throughput (i.e. getting
close to peak flops) because SIMD might not be as efficient or fully
invoked. To put it another way, 32- and 64-bit methods might be moving the
nearly the same number of registers.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX62X54H7CPJ4HRUO62DSOBFHTANCNFSM4QALJVZQ>
.
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
ie the LAPACK tridiagonal solvers don't allocate depending on the precision
of the right side and you see the performance you'd expect. There is more
than narrow bandwidth in play here.
…On Tue, Nov 3, 2020 at 1:46 PM C. T. Kelley ***@***.***> wrote:
However, for tridiagonal matrices, there is no problem.
On Tue, Nov 3, 2020 at 1:42 PM Mikael Slevinsky ***@***.***>
wrote:
> I think there are reasonable explanations:
>
> - mixing precisions uses a generic fallback.
> - narrow bandwidths interfere with maximizing throughput (i.e.
> getting close to peak flops) because SIMD might not be as efficient or
> fully invoked. To put it another way, 32- and 64-bit methods might be
> moving the nearly the same number of registers.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#196 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACOEX62X54H7CPJ4HRUO62DSOBFHTANCNFSM4QALJVZQ>
> .
>
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
***@***.***
https://ctk.math.ncsu.edu
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
As far as I know, LAPACK doesn't mix precisions. I guess you need to follow the stack trace to see what's really going on |
For starters, the factorization itself behaves as I'd expect: half the memory but nearly the same time. julia> using BenchmarkTools
julia> @btime qr(A);
72.829 ms (6 allocations: 76.29 MiB)
julia> @btime qr(A32);
65.758 ms (6 allocations: 38.15 MiB)
|
Yes. It's the solve phase that's trouble.
…On Tue, Nov 3, 2020 at 2:11 PM Mikael Slevinsky ***@***.***> wrote:
For starters, the factorization itself behaves as I'd expect: half the
memory but nearly the same time.
julia> using BenchmarkTools
julia> @Btime qr(A);
72.829 ms (6 allocations: 76.29 MiB)
julia> @Btime qr(A32);
65.758 ms (6 allocations: 38.15 MiB)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX62IE662T7I3RQVXPTLSOBIXLANCNFSM4QALJVZQ>
.
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
LAPACK does it in both Matlab and Julia and the timings/allocations are what one would expect. |
This has nothing to do with BandedMatrices.jl: StdLib/LinearAlgebra.jl converts the factorization to the higher precision: In \(A, B) at /Users/sheehanolver/Projects/julia-1.5/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/qr.jl:870
870 function (\)(A::Union{QR{TA},QRCompactWY{TA},QRPivoted{TA}}, B::AbstractVecOrMat{TB}) where {TA,TB}
871 require_one_based_indexing(B)
872 S = promote_type(TA,TB)
873 m, n = size(A)
874 m == size(B,1) || throw(DimensionMismatch("Both inputs should have the same number of rows"))
875
876 AA = Factorization{S}(A) |
This does not seem to be happening in the dense case. What am I missing?
julia> A=rand(1000,1000);
julia> A32=Float32.(A);
julia> AF=qr!(A);
julia> AF32=qr!(A32);
julia> b=rand(1000);
julia> @time c=AF\b;
0.196028 seconds (548.89 k allocations: 35.333 MiB, 8.81% gc time)
julia> @time c=AF\b;
0.003345 seconds (6 allocations: 7.645 MiB)
julia> @time d=AF32\b;
0.014593 seconds (10 allocations: 15.549 MiB, 34.44% gc time)julia> @time
d=AF32\b;
```
…On Wed, Nov 4, 2020 at 5:23 AM Sheehan Olver ***@***.***> wrote:
This has nothing to do with BandedMatrices.jl: StdLib/LinearAlgebra.jl
converts the factorization to the higher precision:
In \(A, B) at /Users/sheehanolver/Projects/julia-1.5/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/qr.jl:870
870 function (\)(A::Union{QR{TA},QRCompactWY{TA},QRPivoted{TA}}, B::AbstractVecOrMat{TB}) where {TA,TB}
871 require_one_based_indexing(B)
872 S = promote_type(TA,TB)
873 m, n = size(A)874 m == size(B,1) || throw(DimensionMismatch("Both inputs should have the same number of rows"))
875
876 AA = Factorization{S}(A)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX6ZDM2DH4DO7NLEPDS3SOETQNANCNFSM4QALJVZQ>
.
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
[email protected]
https://ctk.math.ncsu.edu
|
Oops. It seems that it's a problem in the dense case as well. It would seem to be that one should not have to cast b to Float32 to get this to work. That was certainly not the case with LINPACK and I doubt it is with LAPACK. I will ask discourse about this. |
Hi
Using LinearAlgebra, if A is dense and Float32 and b is a Float64 vector, A\b returns a Float64 result.
However, if A is a BandedMatrix, A\b fails if b is Float 64. All is well if b is Float32. I don't think I understand what you're doing well enough to formulate a useful PR.
Here is an example
The text was updated successfully, but these errors were encountered: