-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threaded Assembly Performance Degradation #526
Comments
your example misses I get the following:
|
Maybe completely wrong, but create_scratchvalues(K, f, dh) takes about 6 ms on a single core on my laptop, and it isn't threaded so that should increase time per thread. So 8 threads about 0.05 s |
but only the pure assembly is timed. Everything else like ah well nvm, the creation is within the function |
If I remove the buffer creation out of the function and instead pass it I get
|
Mr Amdahl says hello. |
I would argue this is not Ahmdahl's law. In this benchmark we have ~4000 entries per color and ~16 colors, so with 8 threads we should still not see this level of performance loss. I mean, the work per element is approximately constant and independent. Also, strong scaling is also not so great. With n = 10
With n = 20
Edit: To elaborate the 8 threads, we are in 3d and i double the number of elements in each direction, so the load increases by ~2^3. Edit 2: Interestingly here the frontend starts to stall. But I do not think this is enough to explain the performance drop. |
I think I was finally able to track down what is going wrong. The arithmetic intensity of the kernels in the current design of Ferrite is low, because we precompute a good amount of data. This implies that we hit memory limitations faster than expected - which explains the wildly varying scalings in the benchmarks on different systems reported above. We can observe this issue surfacing when monitoring the CPU utilizations (e.g. only 235% for 16 threads on my system). To show that this is not Amdahl's law I equipartitioned each color so each thread has one task, and I pinned the tasks. Further, we can see in LIKWID, that the FLOPS collapse. using Ferrite, SparseArrays, LIKWID
using ThreadPinning
pinthreads(:cores)
function create_colored_cantilever_grid(celltype, n)
grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0)))
colors = create_coloring(grid)
return grid, colors
end;
# #### DofHandler
function create_dofhandler(grid::Grid{dim}) where {dim}
dh = DofHandler(grid)
push!(dh, :u, dim) # Add a displacement field
close!(dh)
end;
# ### Stiffness tensor for linear elasticity
function create_stiffness(::Val{dim}) where {dim}
E = 200e9
ν = 0.3
λ = E*ν / ((1+ν) * (1 - 2ν))
μ = E / (2(1+ν))
δ(i,j) = i == j ? 1.0 : 0.0
g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k))
C = SymmetricTensor{4, dim}(g);
return C
end;
# ## Threaded data structures
#
# ScratchValues is a thread-local collection of data that each thread needs to own,
# since we need to be able to mutate the data in the threads independently
struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti}
Ke::Matrix{T}
fe::Vector{T}
cellvalues::CV
facevalues::FV
global_dofs::Vector{Int}
ɛ::Vector{TT}
coordinates::Vector{Vec{dim, T}}
assembler::Ferrite.AssemblerSparsityPattern{T, Ti}
end;
# Each thread need its own CellValues and FaceValues (although, for this example we don't use
# the FaceValues)
function create_values(refshape, dim, order::Int)
## Interpolations and values
interpolation_space = Lagrange{dim, refshape, 1}()
quadrature_rule = QuadratureRule{dim, refshape}(order)
face_quadrature_rule = QuadratureRule{dim-1, refshape}(order)
cellvalues = [CellVectorValues(quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
facevalues = [FaceVectorValues(face_quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
return cellvalues, facevalues
end;
# Create a `ScratchValues` for each thread with the thread local data
function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim}
nthreads = Threads.nthreads()
assemblers = [start_assemble(K, f) for i in 1:nthreads]
cellvalues, facevalues = create_values(RefCube, dim, 2)
n_basefuncs = getnbasefunctions(cellvalues[1])
global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads]
fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector
Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads]
ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads]
coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads]
return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i],
ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads]
end;
# ## Threaded assemble
# The assembly function loops over each color and does a threaded assembly for that color
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, f::Vector{Float64}, scratches::Vector{SV}, b::Vec{dim}) where {dim, SV}
## Each color is safe to assemble threaded
for color in colors
## We try to equipartition the array to increase load per task.
chunk_size = max(1, 1 + length(color) ÷ Threads.nthreads())
color_partition = [color[i:min(i + chunk_size - 1, end)] for i in 1:chunk_size:length(color)]
## Now we should have a 1:1 correspondence between tasks and elements to assemble.
Threads.@threads :static for i in 1:length(color_partition)
for cellid ∈ color_partition[i]
assemble_cell!(scratches[i], cellid, K, grid, dh, C, b)
end
end
end
return K, f
end
# The cell assembly function is written the same way as if it was a single threaded example.
# The only difference is that we unpack the variables from our `scratch`.
function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC,
grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim}
## Unpack our stuff from the scratch
Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler =
scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues,
scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler
fill!(Ke, 0)
fill!(fe, 0)
n_basefuncs = getnbasefunctions(cellvalues)
## Fill up the coordinates
nodeids = grid.cells[cell].nodes
for j in 1:length(coordinates)
coordinates[j] = grid.nodes[nodeids[j]].x
end
reinit!(cellvalues, coordinates)
for q_point in 1:getnquadpoints(cellvalues)
for i in 1:n_basefuncs
ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i))
end
dΩ = getdetJdV(cellvalues, q_point)
for i in 1:n_basefuncs
δu = shape_value(cellvalues, q_point, i)
fe[i] += (δu ⋅ b) * dΩ
ɛC = ɛ[i] ⊡ C
for j in 1:n_basefuncs
Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ
end
end
end
celldofs!(global_dofs, dh, cell)
assemble!(assembler, global_dofs, fe, Ke)
end;
function run_assemble()
refshape = RefCube
quadrature_order = 5
dim = 3
n = 30
grid, colors = create_colored_cantilever_grid(Hexahedron, n);
dh = create_dofhandler(grid);
K = create_sparsity_pattern(dh);
C = create_stiffness(Val{3}());
f = zeros(ndofs(dh))
scratches = create_scratchvalues(K, f, dh)
b = Vec{3}((0.0, 0.0, 0.0))
## compilation
doassemble(K, colors, grid, dh, C, f, scratches, b);
return @perfmon "FLOPS_DP" doassemble(K, colors, grid, dh, C, f, scratches, b);
end
metrics, events = run_assemble();
clocks = [res["Clock [MHz]"] for res in metrics["FLOPS_DP"]]
println("Clock [MHz] (min, avg, max): ", minimum(clocks), " | ", mean(clocks), " | " , maximum(clocks))
thread_times = [res["Runtime unhalted [s]"] for res in metrics["FLOPS_DP"]]
println("Runtime unhalted [s] (min, avg, max): ", minimum(thread_times), " | ", mean(thread_times), " | " , maximum(thread_times))
println("Total runtime [s] ", first([res["Runtime (RDTSC) [s]"] for res in metrics["FLOPS_DP"]])) |
I dont understand why precomputing/allocating a bunch of values (e.g shape functions) would be worse for multithreaded code... but do you think a sort of "lazy" cellvalues would be better for this case?
The drawback is that we have to recompute N and dNd\xi for each element, but perhaps we could get better threading performance |
What works best has to be investigated further. Also, I am not 100% confident that this is the full picture, yet. Indeed, the common strategy to resolve such issues is to just recompute a bunch of values, or even go full matrix-free and just reevaluate every every time. Please also note that more for expensive assembly this problem is not so pronounced anymore. The "problem" with precomputation (and one copy per thread) is just that you increase the number of accesses to the main memory while having a low number of float point operations in your assembly kernel. This way each assembly kernel puts some amount of pressure on your memory bus via the memory acces. Now, your memory bandwith is limited, while you put more workers on the problem, and with enough parallel workers this memory bandwith becomes the bottleneck, because you cannot move the variables from the memory to the processor cores fast enough to keep them busy. It is quite a simplification, but I hope that the point is clear - does that help? |
I played around a bit with this on the cluster login node, and I got much better scaling than on my laptop. The code is in the
( versioninfo
|
Thanks for taking the time Knut. I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x). The problem is that modern CPU architectures scale up in core count, where the memory bandwidth limit can be quite severe. We will likely see worse performance hits than on our workstations. We can artificially drive up scaling with matrix-free techniques and utilize iterative solvers, but then the problem is that we do not have good preconditioners (because most state-of-the-art preconditioners need global matrix information). |
This numbers should be compared to state of the art linear solvers for the problem imo. If no one can find a solver where our assembly speed is a bottle neck then we are "done". Sure, there is matrix free methods but then that should actually be what is benchmarked. |
Thanks for the comment here Kristoffer. I fully agree with you that we should primarily target to benchmark similar problems, i.e. assembly. I dug a bit into literature and ended up on the WorkStream paper (dx.doi.org/10.1145/2851488) again. According to the analysis and experiments therein, we should be able to increase the scalability quite a bit by increasing data locality. I will try to reproduce the results from the paper in Ferrite (based on the previous work by Fredrik) if I find some free time. |
Hello all. This thread may be relevant: https://discourse.julialang.org/t/parallel-assembly-of-a-finite-element-sparse-matrix/95947 |
@termi-official : I would be curious to have your opinion on the data shown in the above thread. |
I answered in discourse. |
I have to redo the threading model in FerriteAssembly because I got errors due to multi-layered threading ( Somehow when I'm running the "dennis.jl" benchmark again now on the cluster on Julia 1.9, I get even better results, with the following speedups:
(The current implementation in the example stalls at 8 threads, no further improvements. But scratches are created in the timing) I haven't checked if this is due to new julia or other changes. (This was when submitting jobs to the cluster, which gave the most stable results. )
So we are here @termi-official :) But perhaps someone else should verify though on their computer/cluster environment. It might be something strange with my benchmarks. |
Whoa, thanks for the update Knut! I will definitely check after GAMM. :) Edit: Forgot to answer here, I cannot reproduce Knut's results locally. |
Another thing which I had in mind was that it might be interesting to explore the usage of atomics in |
I just found this one https://dl.acm.org/doi/full/10.1145/3503925 "On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs" which I leave here for later reference. |
I cannot reproduce the original issue on Julia 1.10 with the code in #526 (comment) and with 1.10 I can reproduce the scaling you reported (14x speedup for 16 cores on a 16 core machine). 😅 |
@termi-official @KnutAM |
I think the easiest way is to create some empty project, add the code here #526 (comment) into some file (e.g. threaded-assembly-ferrite.jl). In this project you need to add ]add Ferrite#master, SparseArrays, LIKWID, ThreadPinning and now you should be able to run the program via julia and collect the timings. Note that the scalability is still not great for larger systems, because the coloring approach is rather limited (as pointed out by Kristoffer). If you want to have a better scalable solution in Ferrite then you might need a bit longer, as I currently do not have the time to bring the scalable assembly PR over the finish line too soon. However, this approach will be a bit smaller on systems with a small amount of cores. |
Thanks. I was actually interested in your best approach reported above. Will that not work? Or is that the code in #526? |
But we do have two different colorings, where one is from a Bangert Paper implemented by @fredrikekre |
Sorry, I do not follow. If I want to reproduce your results discussed in this thread (where you say you are happy with the speedups), what do I do? Do I follow this recipe #526 (comment)? |
Thanks, I was just a bit unsure what the above meant. |
Do you remember the version it used to work with? I am getting these errors which seem to indicate the ferrite version I have is too far ahead:
|
I got it running with 0.3.11. Not sure if that's the right thing to do. |
Ah, sorry. Indeed I have a local modification to let the example run on master.
10.540680736238489s/0.750866197848292s = 14x |
Cool, I will check it out. Thanks, @termi-official ! |
I ran the simulations on Mac 2 Ultra (no thread pinning).
Results with 1, 2, 4, 8, 16 threads:
How did you guys figure the serial execution time? I took the serial execution from the result with a single thread. However, that is not quite right: the serial execution does not need the scratches, for instance. It doesn't need the colors either. So if the serial execution time goes down, the parallel efficiency will go down, by necessity. |
Not sure how relevant I don't count the difference in setup time (e.g. creating coloring and scratches) since the assembly is usually performed so many times.
|
Doesn't thread pinning give you some extra performance?
Since was interested in scalability here I took one thread as reference. But it should not matter as only the assembly loop is measured. For the total program runtime efficiency will indeed change quite a bit, also because everything but the assembly and solve is serial. |
Not available on the mac. |
FinEtools threaded assembly. n=50.
The code:
I think it would be interesting if you were to run this on your machine. I believe you have a Linux system with 64 cores? On the mac 2 ultra I got the speedups
which isn't that great for 16 threads. |
How different is it from the code posted above here (#526 (comment))? |
The main differences are using The loop structure can be seen here, and the heuristic creation of chunks from Except for the |
Comparison of performance using Ferrite and FinEtools for the cantilever problem. Mac 2 Ultra, Julia 1.10. Mesh with n=50.
Mesh with n=90.
|
I tried adding the masters:
Unfortunately, I am getting an error: What are the versions I should use? |
pkg> add Ferrite#master
pkg> add https://github.com/KnutAM/MaterialModelsBase.jl.git
pkg> add https://github.com/KnutAM/FerriteAssembly.jl.git Should hopefully work |
Goody, I just had those two switched. Now it worked. |
Impressive speedups, @KnutAM !
Sorry: I made an error setting up the grid. Rerunning now.
Overall slower than #526 (comment), but excellent speedup ( |
Thanks for trying it out @PetrKryslUCSD! High performance on FinEtools - nicely done! I'll try to run that later! |
This is my script:
|
The newest FinEtools script:
|
@termi-official and @KnutAM : Is there enough novelty in your investigations of parallel matrix assembly for a paper? If so, would you be interested in writing something? Let me know your thoughts please at [email protected]. I look forward to it. P |
Sorry, I just noticed this. No, this should be equivalent to Ferrite create sparsity pattern + assembly. |
The short answer is: No, there is not even incremental research happening right now. We are merely trying to reproduce a subset of the results from the WorkStream paper and investigate bottlenecks in our implementation (since our implementation underperformed for some reason I cannot fully explain when I opened the thread). There is nothing novel happening here which is not already described in literature about multithreaded assembly. And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce). But thanks for the offer @PetrKryslUCSD , I appreciate it! |
Understood. Thanks. Could you by any chance point to the work from CEED? Thanks. |
I think a good start is https://doi.org/10.1016/j.parco.2021.102841 . A more exhaustive list should be here https://ceed.exascaleproject.org/pubs/ . |
An additional data point: FinEtools assembly only, 64-core Opteron machine with 1, 2, 4, 8, 16, 64 threads:
|
I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper? |
@termi-official Ping... |
GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability. |
I think their solution is really not to build a matrix at all. So, good,
but not a silver bullet...
…On Wed, Mar 6, 2024, 12:17 PM Dennis Ogiermann ***@***.***> wrote:
And I think with the recent work from the CEED the most important
scalability problems for multithreaded assembly are solved anyway (which I
currently try to reproduce).
I must be missing something. The paper you linked does not talk about
threading (if you do not consider GPU computing that). Did you have in mind
a different paper?
GPU parallelism is basically thread parallelism. The paper give an
overview with quite a few references where you can dive deeper. Also see
e.g. Fig 7&8 for some benchmarks where throughput is measured, which can
serve as a proxy for scalability.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/Ferrite-FEM/Ferrite.jl/issues/526*issuecomment-1981708828__;Iw!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8b02IWKo$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACLGGWAYXZ7YO3C4OSD6UEDYW52UHAVCNFSM6AAAAAARKEVDIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRG4YDQOBSHA__;!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8bH-RMZM$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Currently threaded assembly does not scale to more than 3 cores on any machine I tried and I cannot figure out why. For the measurement I have modified threaded_assembly.jl to also utilize LinuxPerf.jl.
Here some measurements on a machine with 16 (32) threads
Eliminating the calls to
assemble!
,reinit!
andshape_*
does not increase scalability. Also increasing the work load by replacing the linear problem with the assembly from the hyperelastic (i.e. NeoHookean) example does not significantly increase scalability.Happy for any suggestions what possible points of failure could be.
Source Code
TODOs
The text was updated successfully, but these errors were encountered: