Threaded Assembly Performance Degradation #526

termi-official · 2022-10-20T13:18:21Z

Currently threaded assembly does not scale to more than 3 cores on any machine I tried and I cannot figure out why. For the measurement I have modified threaded_assembly.jl to also utilize LinuxPerf.jl.

Here some measurements on a machine with 16 (32) threads

~/Tools/julia-1.8.2/bin/julia --threads 2 --project src/literate/threaded_assembly.jl                               ✔  22s  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294685)
┌ cpu-cycles               3.83e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  2.42e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   2.79e+09  100.0%  # 72.7% of cycles
┌ instructions             7.18e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      4.41e+08  100.0%  #  6.1% of insns
└ branch-misses            2.61e+06  100.0%  #  0.6% of branch insns
┌ task-clock               1.33e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              9.96e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294695)
┌ cpu-cycles               3.79e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  2.30e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   2.74e+09  100.0%  # 72.4% of cycles
┌ instructions             7.16e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      4.39e+08  100.0%  #  6.1% of insns
└ branch-misses            2.54e+06  100.0%  #  0.6% of branch insns
┌ task-clock               1.31e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.84e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.62e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  4.73e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   5.53e+09  100.0%  # 72.5% of cycles
┌ instructions             1.43e+10  100.0%  #  1.9 insns per cycle
│ branch-instructions      8.80e+08  100.0%  #  6.1% of insns
└ branch-misses            5.15e+06  100.0%  #  0.6% of branch insns
┌ task-clock               2.64e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.18e+03  100.0%
                  aggregated from 2 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
1.353969 seconds (28.03 k allocations: 3.264 MiB)

 ~/Tools/julia-1.8.2/bin/julia --threads 4 --project src/literate/threaded_assembly.jl                               ✔  24s  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294544)
┌ cpu-cycles               2.00e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.21e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.46e+09  100.0%  # 73.3% of cycles
┌ instructions             3.64e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.27e+08  100.0%  #  6.2% of insns
└ branch-misses            1.33e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.94e+08  100.0%  # 694.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.19e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294546)
┌ cpu-cycles               1.92e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.09e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.41e+09  100.0%  # 73.2% of cycles
┌ instructions             3.58e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            1.24e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.65e+08  100.0%  # 665.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 294547)
┌ cpu-cycles               1.97e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.62e+07  100.0%  #  0.8% of cycles
└ stalled-cycles-backend   1.43e+09  100.0%  # 73.0% of cycles
┌ instructions             3.58e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            1.25e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.83e+08  100.0%  # 682.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 294548)
┌ cpu-cycles               1.98e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.22e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.47e+09  100.0%  # 73.9% of cycles
┌ instructions             3.58e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.19e+08  100.0%  #  6.1% of insns
└ branch-misses            1.24e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.89e+08  100.0%  # 688.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.87e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  5.13e+07  100.0%  #  0.7% of cycles
└ stalled-cycles-backend   5.77e+09  100.0%  # 73.3% of cycles
┌ instructions             1.44e+10  100.0%  #  1.8 insns per cycle
│ branch-instructions      8.86e+08  100.0%  #  6.2% of insns
└ branch-misses            5.07e+06  100.0%  #  0.6% of branch insns
┌ task-clock               2.73e+09  100.0%  #  2.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.20e+03  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.789631 seconds (55.97 k allocations: 4.494 MiB)

  ~/Tools/julia-1.8.2/bin/julia --threads 8 --project src/literate/threaded_assembly.jl                                          ✔ 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294295)
┌ cpu-cycles               1.46e+09  100.0%  #  2.7 cycles per ns
│ stalled-cycles-frontend  5.46e+06  100.0%  #  0.4% of cycles
└ stalled-cycles-backend   1.16e+09  100.0%  # 79.5% of cycles
┌ instructions             1.91e+09  100.0%  #  1.3 insns per cycle
│ branch-instructions      1.28e+08  100.0%  #  6.7% of insns
└ branch-misses            8.05e+05  100.0%  #  0.6% of branch insns
┌ task-clock               5.31e+08  100.0%  # 530.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              3.30e+04  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294297)
┌ cpu-cycles               1.20e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.26e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.25e+08  100.0%  # 77.3% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.62e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.19e+08  100.0%  # 419.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.73e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 294298)
┌ cpu-cycles               1.21e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.30e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.35e+08  100.0%  # 77.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.60e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.19e+08  100.0%  # 418.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.72e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 294299)
┌ cpu-cycles               1.24e+09  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  6.68e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.61e+08  100.0%  # 77.4% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.76e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.35e+08  100.0%  # 435.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.95e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 294300)
┌ cpu-cycles               1.19e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  5.29e+06  100.0%  #  0.4% of cycles
└ stalled-cycles-backend   9.25e+08  100.0%  # 77.7% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.59e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.17e+08  100.0%  # 417.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.59e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 294301)
┌ cpu-cycles               1.27e+09  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  5.86e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.99e+08  100.0%  # 79.0% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.52e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.44e+08  100.0%  # 444.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.32e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 294302)
┌ cpu-cycles               1.28e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.34e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   1.01e+09  100.0%  # 79.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.57e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.43e+08  100.0%  # 443.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.65e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 294303)
┌ cpu-cycles               1.31e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.40e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   1.04e+09  100.0%  # 79.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.63e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.56e+08  100.0%  # 455.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.43e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               1.01e+10  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  4.86e+07  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   7.95e+09  100.0%  # 78.4% of cycles
┌ instructions             1.44e+10  100.0%  #  1.4 insns per cycle
│ branch-instructions      8.96e+08  100.0%  #  6.2% of insns
└ branch-misses            5.43e+06  100.0%  #  0.6% of branch insns
┌ task-clock               3.56e+09  100.0%  #  3.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              3.62e+04  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.609788 seconds (111.85 k allocations: 6.954 MiB)

Eliminating the calls to assemble!, reinit! and shape_* does not increase scalability. Also increasing the work load by replacing the linear problem with the assembly from the hyperelastic (i.e. NeoHookean) example does not significantly increase scalability.

Happy for any suggestions what possible points of failure could be.

Source Code

using Ferrite, SparseArrays
using LinuxPerf

function create_example_2d_grid()
    grid = generate_grid(Quadrilateral, (10, 10), Vec{2}((0.0, 0.0)), Vec{2}((10.0, 10.0)))
    colors_workstream = create_coloring(grid; alg=ColoringAlgorithm.WorkStream)
    colors_greedy = create_coloring(grid; alg=ColoringAlgorithm.Greedy)
    vtk_grid("colored", grid) do vtk
        vtk_cell_data_colors(vtk, colors_workstream, "workstream-coloring")
        vtk_cell_data_colors(vtk, colors_greedy, "greedy-coloring")
    end
end

create_example_2d_grid();

# ![](coloring.png)
#
# *Figure 1*: Element coloring using the "workstream"-algorithm (left) and the "greedy"-
# algorithm (right).

# ## Cantilever beam in 3D with threaded assembly
# We will now look at an example where we assemble the stiffness matrix using multiple
# threads. We set up a simple grid and create a coloring, then create a DofHandler,
# and define the material stiffness

# #### Grid for the beam
function create_colored_cantilever_grid(celltype, n)
    grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0)))
    colors = create_coloring(grid)
    return grid, colors
end;

# #### DofHandler
function create_dofhandler(grid::Grid{dim}) where {dim}
    dh = DofHandler(grid)
    push!(dh, :u, dim) # Add a displacement field
    close!(dh)
end;

# ### Stiffness tensor for linear elasticity
function create_stiffness(::Val{dim}) where {dim}
    E = 200e9
    ν = 0.3
    λ = E*ν / ((1+ν) * (1 - 2ν))
    μ = E / (2(1+ν))
    δ(i,j) = i == j ? 1.0 : 0.0
    g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k))
    C = SymmetricTensor{4, dim}(g);
    return C
end;

# ## Threaded data structures
#
# ScratchValues is a thread-local collection of data that each thread needs to own,
# since we need to be able to mutate the data in the threads independently
struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti}
    Ke::Matrix{T}
    fe::Vector{T}
    cellvalues::CV
    facevalues::FV
    global_dofs::Vector{Int}
    ɛ::Vector{TT}
    coordinates::Vector{Vec{dim, T}}
    assembler::Ferrite.AssemblerSparsityPattern{T, Ti}
end;

# Each thread need its own CellValues and FaceValues (although, for this example we don't use
# the FaceValues)
function create_values(refshape, dim, order::Int)
    ## Interpolations and values
    interpolation_space = Lagrange{dim, refshape, 1}()
    quadrature_rule = QuadratureRule{dim, refshape}(order)
    face_quadrature_rule = QuadratureRule{dim-1, refshape}(order)
    cellvalues = [CellVectorValues(quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
    facevalues = [FaceVectorValues(face_quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
    return cellvalues, facevalues
end;

# Create a `ScratchValues` for each thread with the thread local data
function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim}
    nthreads = Threads.nthreads()
    assemblers = [start_assemble(K, f) for i in 1:nthreads]
    cellvalues, facevalues = create_values(RefCube, dim, 2)

    n_basefuncs = getnbasefunctions(cellvalues[1])
    global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads]

    fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector
    Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads]

    ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads]

    coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads]

    return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i],
                         ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads]
end;

# ## Threaded assemble

# The assembly function loops over each color and does a threaded assembly for that color
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim}

    f = zeros(ndofs(dh))
    scratches = create_scratchvalues(K, f, dh)
    b = Vec{3}((0.0, 0.0, 0.0)) # Body force

    for color in colors
        ## Each color is safe to assemble threaded
        Threads.@threads for i in 1:length(color)
            assemble_cell!(scratches[Threads.threadid()], color[i], K, grid, dh, C, b)
        end
    end

    return K, f
end

# The cell assembly function is written the same way as if it was a single threaded example.
# The only difference is that we unpack the variables from our `scratch`.
function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC,
                        grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim}

    ## Unpack our stuff from the scratch
    Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler =
         scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues,
         scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler

    fill!(Ke, 0)
    fill!(fe, 0)

    n_basefuncs = getnbasefunctions(cellvalues)

    ## Fill up the coordinates
    nodeids = grid.cells[cell].nodes
    for j in 1:length(coordinates)
        coordinates[j] = grid.nodes[nodeids[j]].x
    end

    reinit!(cellvalues, coordinates)

    for q_point in 1:getnquadpoints(cellvalues)
        for i in 1:n_basefuncs
            ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i))
        end
        dΩ = getdetJdV(cellvalues, q_point)
        for i in 1:n_basefuncs
            δu = shape_value(cellvalues, q_point, i)
            fe[i] += (δu ⋅ b) * dΩ
            ɛC = ɛ[i] ⊡ C
            for j in 1:n_basefuncs
                Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ
            end
        end
    end

    celldofs!(global_dofs, dh, cell)
    assemble!(assembler, global_dofs, fe, Ke)
end;

function run_assemble()
    refshape = RefCube
    quadrature_order = 2
    dim = 3
    n = 20
    grid, colors = create_colored_cantilever_grid(Hexahedron, n);
    dh = create_dofhandler(grid);

    K = create_sparsity_pattern(dh);
    C = create_stiffness(Val{3}());
    ## compilation
    doassemble(K, colors, grid, dh, C);
    stats = @pstats doassemble(K, colors, grid, dh, C);
    LinuxPerf.printsummary(stats, expandthreads = true)
    b = @elapsed @time K, f = doassemble(K, colors, grid, dh, C);
    return b
end

TODOs

revisit TODO improve threading performance example #232
revisit special FunctionValues for the shared memory case

The text was updated successfully, but these errors were encountered:

koehlerson · 2022-10-20T14:22:13Z

your example misses run_assemble() at the bottom and I would be very interested in this as well

I get the following:

julia --threads 2 --project src/literate/threaded_assembly.jl    
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 135838)
╶ cpu-cycles               2.56e+09  100.0%  #  4.2 cycles per ns
┌ instructions             7.19e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      4.37e+08  100.0%  #  6.1% of insns
└ branch-misses            1.20e+06  100.0%  #  0.3% of branch insns
┌ task-clock               6.13e+08  100.0%  # 612.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.03e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 135840)
╶ cpu-cycles               3.99e+09    7.1%  #  4.1 cycles per ns
┌ instructions             1.14e+10    7.1%  #  2.9 insns per cycle
│ branch-instructions      6.95e+08    7.1%  #  6.1% of insns
└ branch-misses            2.06e+06    7.1%  #  0.3% of branch insns
┌ task-clock               9.68e+08  100.0%  # 967.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
╶ cpu-cycles               6.58e+09   43.1%  #  4.2 cycles per ns
┌ instructions             1.85e+10   43.1%  #  2.8 insns per cycle
│ branch-instructions      1.13e+09   43.1%  #  6.1% of insns
└ branch-misses            3.12e+06   43.1%  #  0.3% of branch insns
┌ task-clock               1.58e+09  100.0%  #  1.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.03e+02  100.0%
                  aggregated from 2 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  1.036346 seconds (28.03 k allocations: 3.264 MiB)

julia --threads 4 --project src/literate/threaded_assembly.jl    
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 135863)
╶ cpu-cycles               1.45e+09  100.0%  #  3.9 cycles per ns
┌ instructions             3.61e+09  100.0%  #  2.5 insns per cycle
│ branch-instructions      2.21e+08  100.0%  #  6.1% of insns
└ branch-misses            6.24e+05  100.0%  #  0.3% of branch insns
┌ task-clock               3.69e+08  100.0%  # 369.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.64e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 135865)
╶ cpu-cycles               1.24e+09  100.0%  #  3.9 cycles per ns
┌ instructions             3.60e+09  100.0%  #  2.9 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            5.79e+05  100.0%  #  0.3% of branch insns
┌ task-clock               3.14e+08  100.0%  # 314.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 135866)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               5.13e+08  100.0%  # 513.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 135867)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               5.15e+08  100.0%  # 515.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
╶ cpu-cycles               6.74e+09   39.9%  #  3.9 cycles per ns
┌ instructions             1.81e+10   39.9%  #  2.7 insns per cycle
│ branch-instructions      1.10e+09   39.9%  #  6.1% of insns
└ branch-misses            3.02e+06   39.9%  #  0.3% of branch insns
┌ task-clock               1.71e+09  100.0%  #  1.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.65e+02  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0.567853 seconds (55.97 k allocations: 4.494 MiB)

julia --threads 8 --project src/literate/threaded_assembly.jl    
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 135887)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               4.01e+08  100.0%  # 401.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.39e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 135889)
╶ cpu-cycles               6.37e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            3.27e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.83e+08  100.0%  # 183.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 135890)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               2.78e+08  100.0%  # 277.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 135891)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               2.75e+08  100.0%  # 275.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 135892)
╶ cpu-cycles               6.37e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            3.31e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.83e+08  100.0%  # 183.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 135893)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               2.78e+08  100.0%  # 277.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 135894)
╶ cpu-cycles               6.36e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            3.24e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.83e+08  100.0%  # 183.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 135895)
╶ cpu-cycles               6.14e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.9 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            3.22e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.77e+08  100.0%  # 177.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
╶ cpu-cycles               6.80e+09   37.2%  #  3.5 cycles per ns
┌ instructions             1.94e+10   37.2%  #  2.8 insns per cycle
│ branch-instructions      1.18e+09   37.2%  #  6.1% of insns
└ branch-misses            3.51e+06   37.2%  #  0.3% of branch insns
┌ task-clock               1.96e+09  100.0%  #  2.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.41e+02  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0.418080 seconds (111.85 k allocations: 6.954 MiB)

KnutAM · 2022-10-20T16:23:31Z

Maybe completely wrong, but create_scratchvalues(K, f, dh) takes about 6 ms on a single core on my laptop, and it isn't threaded so that should increase time per thread. So 8 threads about 0.05 s
Calling assemble_cell! takes me 12.3 us => * numcells=0.98s and full run is 1.35 s.
So perfect scaling for the loop would be 0.37 + 0.006*nthreads + 0.98/nthreads
"1: 1.356"
"2: 0.872"
"3: 0.7147"
"4: 0.639"
"8: 0.5405"

koehlerson · 2022-10-20T16:26:33Z

but only the pure assembly is timed. Everything else like creaate_scratchvalues shouldn't been taken into considerations, no?

ah well nvm, the creation is within the function

koehlerson · 2022-10-20T16:35:01Z

If I remove the buffer creation out of the function and instead pass it I get

max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 1 --project src/literate/threaded_assembly.jl                       10s 18:33:52
  1.191017 seconds (108 allocations: 17.719 KiB)
max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 2 --project src/literate/threaded_assembly.jl                     1m 8s 18:33:06
  0.624988 seconds (250 allocations: 35.938 KiB)
max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 4 --project src/literate/threaded_assembly.jl                       11s 18:33:21
  0.526636 seconds (471 allocations: 70.688 KiB)
max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 8 --project src/literate/threaded_assembly.jl                       11s 18:33:36
  0.300301 seconds (923 allocations: 140.500 KiB)

KristofferC · 2022-10-20T16:43:43Z

Mr Amdahl says hello.

koehlerson · 2022-10-20T17:22:24Z

Mr Amdahl says hello.

termi-official · 2022-10-20T17:23:02Z

I would argue this is not Ahmdahl's law. In this benchmark we have ~4000 entries per color and ~16 colors, so with 8 threads we should still not see this level of performance loss. I mean, the work per element is approximately constant and independent. Also, strong scaling is also not so great.
Interestingly I also have some benchmarks where I fully decouple the assembly emulating a partition of the problem like in distributed problems and observed also scaling issues beyond 4 threads.

With n = 10

 ~/Tools/julia-1.8.2/bin/julia --threads 1 --project src/literate/threaded_assembly.jl                               ✔  21s  
DoFs: 36663
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 328117)
┌ cpu-cycles               8.70e+08  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  1.40e+06  100.0%  #  0.2% of cycles
└ stalled-cycles-backend   6.64e+08  100.0%  # 76.3% of cycles
┌ instructions             1.79e+09  100.0%  #  2.1 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.22e+05  100.0%  #  0.6% of branch insns
┌ task-clock               3.13e+08  100.0%  # 312.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.47e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
┌ cpu-cycles               8.70e+08  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  1.40e+06  100.0%  #  0.2% of cycles
└ stalled-cycles-backend   6.64e+08  100.0%  # 76.3% of cycles
┌ instructions             1.79e+09  100.0%  #  2.1 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.22e+05  100.0%  #  0.6% of branch insns
┌ task-clock               3.13e+08  100.0%  # 312.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.47e+02  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.300560 seconds (14.03 k allocations: 919.344 KiB)

With n = 20

 ~/Tools/julia-1.8.2/bin/julia --threads 8 --project src/literate/threaded_assembly.jl                               ✔  22s  
DoFs: 265923
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 327996)
┌ cpu-cycles               1.50e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.02e+08  100.0%  #  6.8% of cycles
└ stalled-cycles-backend   1.09e+09  100.0%  # 72.3% of cycles
┌ instructions             1.92e+09  100.0%  #  1.3 insns per cycle
│ branch-instructions      1.31e+08  100.0%  #  6.8% of insns
└ branch-misses            6.81e+05  100.0%  #  0.5% of branch insns
┌ task-clock               5.23e+08  100.0%  # 523.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.32e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 327998)
┌ cpu-cycles               1.03e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  9.28e+07  100.0%  #  9.0% of cycles
└ stalled-cycles-backend   6.69e+08  100.0%  # 64.6% of cycles
┌ instructions             1.78e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.06e+08  100.0%  #  6.0% of insns
└ branch-misses            6.45e+05  100.0%  #  0.6% of branch insns
┌ task-clock               3.60e+08  100.0%  # 359.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 327999)
┌ cpu-cycles               1.14e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.28e+08  100.0%  # 11.2% of cycles
└ stalled-cycles-backend   7.78e+08  100.0%  # 68.0% of cycles
┌ instructions             1.79e+09  100.0%  #  1.6 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.96e+05  100.0%  #  0.5% of branch insns
┌ task-clock               3.97e+08  100.0%  # 396.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 328000)
┌ cpu-cycles               1.17e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.30e+08  100.0%  # 11.1% of cycles
└ stalled-cycles-backend   8.04e+08  100.0%  # 68.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.00e+05  100.0%  #  0.5% of branch insns
┌ task-clock               4.05e+08  100.0%  # 405.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 328001)
┌ cpu-cycles               1.08e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  9.13e+07  100.0%  #  8.5% of cycles
└ stalled-cycles-backend   7.60e+08  100.0%  # 70.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.98e+05  100.0%  #  0.5% of branch insns
┌ task-clock               3.74e+08  100.0%  # 374.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 328002)
┌ cpu-cycles               1.17e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.29e+08  100.0%  # 11.1% of cycles
└ stalled-cycles-backend   8.00e+08  100.0%  # 68.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.96e+05  100.0%  #  0.5% of branch insns
┌ task-clock               4.09e+08  100.0%  # 409.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 328003)
┌ cpu-cycles               1.20e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.40e+08  100.0%  # 11.7% of cycles
└ stalled-cycles-backend   8.27e+08  100.0%  # 69.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.91e+05  100.0%  #  0.5% of branch insns
┌ task-clock               4.15e+08  100.0%  # 414.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 328004)
┌ cpu-cycles               1.11e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.38e+08  100.0%  # 12.5% of cycles
└ stalled-cycles-backend   7.39e+08  100.0%  # 66.9% of cycles
┌ instructions             1.79e+09  100.0%  #  1.6 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.95e+05  100.0%  #  0.5% of branch insns
┌ task-clock               3.83e+08  100.0%  # 382.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               9.40e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  9.52e+08  100.0%  # 10.1% of cycles
└ stalled-cycles-backend   6.46e+09  100.0%  # 68.7% of cycles
┌ instructions             1.44e+10  100.0%  #  1.5 insns per cycle
│ branch-instructions      8.97e+08  100.0%  #  6.2% of insns
└ branch-misses            4.90e+06  100.0%  #  0.5% of branch insns
┌ task-clock               3.27e+09  100.0%  #  3.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.36e+02  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0.637576 seconds (111.86 k allocations: 6.954 MiB)

Edit: To elaborate the 8 threads, we are in 3d and i double the number of elements in each direction, so the load increases by ~2^3.

Edit 2: Interestingly here the frontend starts to stall. But I do not think this is enough to explain the performance drop.

termi-official · 2023-01-20T15:52:17Z

I think I was finally able to track down what is going wrong. The arithmetic intensity of the kernels in the current design of Ferrite is low, because we precompute a good amount of data. This implies that we hit memory limitations faster than expected - which explains the wildly varying scalings in the benchmarks on different systems reported above. We can observe this issue surfacing when monitoring the CPU utilizations (e.g. only 235% for 16 threads on my system).

To show that this is not Amdahl's law I equipartitioned each color so each thread has one task, and I pinned the tasks. Further, we can see in LIKWID, that the FLOPS collapse.

using Ferrite, SparseArrays, LIKWID

using ThreadPinning
pinthreads(:cores)

function create_colored_cantilever_grid(celltype, n)
    grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0)))
    colors = create_coloring(grid)
    return grid, colors
end;

# #### DofHandler
function create_dofhandler(grid::Grid{dim}) where {dim}
    dh = DofHandler(grid)
    push!(dh, :u, dim) # Add a displacement field
    close!(dh)
end;

# ### Stiffness tensor for linear elasticity
function create_stiffness(::Val{dim}) where {dim}
    E = 200e9
    ν = 0.3
    λ = E*ν / ((1+ν) * (1 - 2ν))
    μ = E / (2(1+ν))
    δ(i,j) = i == j ? 1.0 : 0.0
    g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k))
    C = SymmetricTensor{4, dim}(g);
    return C
end;

# ## Threaded data structures
#
# ScratchValues is a thread-local collection of data that each thread needs to own,
# since we need to be able to mutate the data in the threads independently
struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti}
    Ke::Matrix{T}
    fe::Vector{T}
    cellvalues::CV
    facevalues::FV
    global_dofs::Vector{Int}
    ɛ::Vector{TT}
    coordinates::Vector{Vec{dim, T}}
    assembler::Ferrite.AssemblerSparsityPattern{T, Ti}
end;

# Each thread need its own CellValues and FaceValues (although, for this example we don't use
# the FaceValues)
function create_values(refshape, dim, order::Int)
    ## Interpolations and values
    interpolation_space = Lagrange{dim, refshape, 1}()
    quadrature_rule = QuadratureRule{dim, refshape}(order)
    face_quadrature_rule = QuadratureRule{dim-1, refshape}(order)
    cellvalues = [CellVectorValues(quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
    facevalues = [FaceVectorValues(face_quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
    return cellvalues, facevalues
end;

# Create a `ScratchValues` for each thread with the thread local data
function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim}
    nthreads = Threads.nthreads()
    assemblers = [start_assemble(K, f) for i in 1:nthreads]
    cellvalues, facevalues = create_values(RefCube, dim, 2)

    n_basefuncs = getnbasefunctions(cellvalues[1])
    global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads]

    fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector
    Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads]

    ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads]

    coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads]

    return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i],
                         ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads]
end;

# ## Threaded assemble

# The assembly function loops over each color and does a threaded assembly for that color
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, f::Vector{Float64}, scratches::Vector{SV}, b::Vec{dim}) where {dim, SV}
    ## Each color is safe to assemble threaded
    for color in colors
        ## We try to equipartition the array to increase load per task.
        chunk_size = max(1, 1 + length(color) ÷ Threads.nthreads())
        color_partition = [color[i:min(i + chunk_size - 1, end)] for i in 1:chunk_size:length(color)]
        ## Now we should have a 1:1 correspondence between tasks and elements to assemble.
        Threads.@threads :static for i in 1:length(color_partition)
            for cellid ∈ color_partition[i]
                assemble_cell!(scratches[i], cellid, K, grid, dh, C, b)
            end
        end
    end

    return K, f
end

# The cell assembly function is written the same way as if it was a single threaded example.
# The only difference is that we unpack the variables from our `scratch`.
function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC,
                        grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim}

    ## Unpack our stuff from the scratch
    Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler =
         scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues,
         scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler

    fill!(Ke, 0)
    fill!(fe, 0)

    n_basefuncs = getnbasefunctions(cellvalues)

    ## Fill up the coordinates
    nodeids = grid.cells[cell].nodes
    for j in 1:length(coordinates)
        coordinates[j] = grid.nodes[nodeids[j]].x
    end

    reinit!(cellvalues, coordinates)

    for q_point in 1:getnquadpoints(cellvalues)
        for i in 1:n_basefuncs
            ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i))
        end
        dΩ = getdetJdV(cellvalues, q_point)
        for i in 1:n_basefuncs
            δu = shape_value(cellvalues, q_point, i)
            fe[i] += (δu ⋅ b) * dΩ
            ɛC = ɛ[i] ⊡ C
            for j in 1:n_basefuncs
                Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ
            end
        end
    end

    celldofs!(global_dofs, dh, cell)
    assemble!(assembler, global_dofs, fe, Ke)
end;

function run_assemble()
    refshape = RefCube
    quadrature_order = 5
    dim = 3
    n = 30
    grid, colors = create_colored_cantilever_grid(Hexahedron, n);
    dh = create_dofhandler(grid);

    K = create_sparsity_pattern(dh);
    C = create_stiffness(Val{3}());
    f = zeros(ndofs(dh))
    scratches = create_scratchvalues(K, f, dh)
    b = Vec{3}((0.0, 0.0, 0.0))
    ## compilation
    doassemble(K, colors, grid, dh, C, f, scratches, b);
    return @perfmon "FLOPS_DP" doassemble(K, colors, grid, dh, C, f, scratches, b);
end

metrics, events = run_assemble();
clocks = [res["Clock [MHz]"] for res in metrics["FLOPS_DP"]]
println("Clock [MHz] (min, avg, max): ", minimum(clocks), " | ", mean(clocks), " | " , maximum(clocks)) 

thread_times = [res["Runtime unhalted [s]"] for res in metrics["FLOPS_DP"]]
println("Runtime unhalted [s] (min, avg, max): ", minimum(thread_times), " | ", mean(thread_times), " | " , maximum(thread_times))

println("Total runtime [s] ", first([res["Runtime (RDTSC) [s]"] for res in metrics["FLOPS_DP"]]))

lijas · 2023-02-15T08:03:13Z

I dont understand why precomputing/allocating a bunch of values (e.g shape functions) would be worse for multithreaded code... but do you think a sort of "lazy" cellvalues would be better for this case?

function element_routine!()
for iqp in 1:getncellvalues(cv)
     reinit!(cv, iqp) #compute shapevalues, jacobian, shape_gradient for current gauss point here

The drawback is that we have to recompute N and dNd\xi for each element, but perhaps we could get better threading performance

termi-official · 2023-02-15T08:49:44Z

What works best has to be investigated further. Also, I am not 100% confident that this is the full picture, yet. Indeed, the common strategy to resolve such issues is to just recompute a bunch of values, or even go full matrix-free and just reevaluate every every time. Please also note that more for expensive assembly this problem is not so pronounced anymore.

The "problem" with precomputation (and one copy per thread) is just that you increase the number of accesses to the main memory while having a low number of float point operations in your assembly kernel. This way each assembly kernel puts some amount of pressure on your memory bus via the memory acces. Now, your memory bandwith is limited, while you put more workers on the problem, and with enough parallel workers this memory bandwith becomes the bottleneck, because you cannot move the variables from the memory to the processor cores fast enough to keep them busy. It is quite a simplification, but I hope that the point is clear - does that help?

KnutAM · 2023-02-16T19:28:21Z

I played around a bit with this on the cluster login node, and I got much better scaling than on my laptop.
Except for the case in the example (old.jl) where the scratch is created inside the timing, it seems to scale quite ok? Or should one expect better scaling efficiency?

The code is in the kam/StaticValues branch with benchmarks referred to in the table in the dev folder. There I implemented "static" cellvalues such that isbitstype(cv)=true (in an attempt to see if things would end up on each core's L1/L2 cache, but didn't seem to work based on timings).

.	t [s]				t*threads				loss
threads	static.jl	dyn.jl	old.jl	dennis.jl	static.jl	dyn.jl	old.jl	dennis.jl	static.jl	dyn.jl	old.jl	dennis.jl
1	3.219	2.998	2.118	1.899	3.2	3.0	2.1	1.9	0%	0%	0%	0%
2	1.966	1.867	1.252	1.106	3.9	3.7	2.5	2.2	22%	25%	18%	16%
4	1.074	0.944	0.880	0.608	4.3	3.8	3.5	2.4	33%	26%	66%	28%
8	0.564	0.557	0.615	0.340	4.5	4.5	4.9	2.7	40%	49%	132%	43%
16	0.334	0.318	0.663	0.204	5.3	5.1	10.6	3.3	66%	70%	401%	72%

(loss = threads*t(threads)/t(1) - 1, where t is the time as a function of number of threads)

versioninfo

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 40 virtual cores

termi-official · 2023-02-16T20:10:41Z

Thanks for taking the time Knut. I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x). The problem is that modern CPU architectures scale up in core count, where the memory bandwidth limit can be quite severe. We will likely see worse performance hits than on our workstations. We can artificially drive up scaling with matrix-free techniques and utilize iterative solvers, but then the problem is that we do not have good preconditioners (because most state-of-the-art preconditioners need global matrix information).

KristofferC · 2023-02-16T20:14:53Z

This numbers should be compared to state of the art linear solvers for the problem imo. If no one can find a solver where our assembly speed is a bottle neck then we are "done". Sure, there is matrix free methods but then that should actually be what is benchmarked.

termi-official · 2023-02-27T13:32:53Z

Thanks for the comment here Kristoffer. I fully agree with you that we should primarily target to benchmark similar problems, i.e. assembly. I dug a bit into literature and ended up on the WorkStream paper (dx.doi.org/10.1145/2851488) again. According to the analysis and experiments therein, we should be able to increase the scalability quite a bit by increasing data locality. I will try to reproduce the results from the paper in Ferrite (based on the previous work by Fredrik) if I find some free time.

PetrKryslUCSD · 2023-03-14T18:38:42Z

Hello all. This thread may be relevant: https://discourse.julialang.org/t/parallel-assembly-of-a-finite-element-sparse-matrix/95947

PetrKryslUCSD · 2023-03-17T02:57:43Z

@termi-official : I would be curious to have your opinion on the data shown in the above thread.

termi-official · 2023-03-17T13:49:50Z

I answered in discourse.

KnutAM · 2023-05-28T11:43:05Z

I have to redo the threading model in FerriteAssembly because I got errors due to multi-layered threading (:static not ok), this should probably be included in Ferrite.jl's example as well (branches fe/spawn and fe/workstream are works in progress I believe (thanks for the inspiration Fredrik for my attempts))

Somehow when I'm running the "dennis.jl" benchmark again now on the cluster on Julia 1.9, I get even better results, with the following speedups:

N	`dennis.jl`	`FerriteAssembly.jl`
1	1.000	1.000
2	1.965	1.888
4	3.895	3.727
8	7.655	7.359
16	13.455	14.350

(The current implementation in the example stalls at 8 threads, no further improvements. But scratches are created in the timing)

I haven't checked if this is due to new julia or other changes. (This was when submitting jobs to the cluster, which gave the most stable results. )

I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x).

So we are here @termi-official :) But perhaps someone else should verify though on their computer/cluster environment. It might be something strange with my benchmarks.

termi-official · 2023-05-28T19:43:03Z

Whoa, thanks for the update Knut! I will definitely check after GAMM. :)

Edit: Forgot to answer here, I cannot reproduce Knut's results locally.

termi-official · 2023-10-19T17:42:01Z

Another thing which I had in mind was that it might be interesting to explore the usage of atomics in assemble! for threaded code, e.g. via assemble_atomic!.

termi-official · 2023-10-26T17:28:07Z

I just found this one https://dl.acm.org/doi/full/10.1145/3503925 "On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs" which I leave here for later reference.

termi-official · 2024-02-09T15:16:56Z

I have to redo the threading model in FerriteAssembly because I got errors due to multi-layered threading (:static not ok), this should probably be included in Ferrite.jl's example as well (branches fe/spawn and fe/workstream are works in progress I believe (thanks for the inspiration Fredrik for my attempts))

Somehow when I'm running the "dennis.jl" benchmark again now on the cluster on Julia 1.9, I get even better results, with the following speedups:
N dennis.jl FerriteAssembly.jl
1 1.000 1.000
2 1.965 1.888
4 3.895 3.727
8 7.655 7.359
16 13.455 14.350

(The current implementation in the example stalls at 8 threads, no further improvements. But scratches are created in the timing)

I haven't checked if this is due to new julia or other changes. (This was when submitting jobs to the cluster, which gave the most stable results. )

I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x).

So we are here @termi-official :) But perhaps someone else should verify though on their computer/cluster environment. It might be something strange with my benchmarks.

I cannot reproduce the original issue on Julia 1.10 with the code in #526 (comment) and with 1.10 I can reproduce the scaling you reported (14x speedup for 16 cores on a 16 core machine). 😅

PetrKryslUCSD · 2024-02-15T17:57:56Z

@termi-official @KnutAM
Hi, I'd like to test drive the thread based assembly in Ferrite.
What is the correct way please?
Thanks!
Petr

termi-official · 2024-02-15T18:14:05Z

I think the easiest way is to create some empty project, add the code here #526 (comment) into some file (e.g. threaded-assembly-ferrite.jl). In this project you need to add

]add Ferrite#master, SparseArrays, LIKWID, ThreadPinning

and now you should be able to run the program via julia and collect the timings. Note that the scalability is still not great for larger systems, because the coloring approach is rather limited (as pointed out by Kristoffer). If you want to have a better scalable solution in Ferrite then you might need a bit longer, as I currently do not have the time to bring the scalable assembly PR over the finish line too soon. However, this approach will be a bit smaller on systems with a small amount of cores.

PetrKryslUCSD · 2024-02-15T18:21:48Z

Thanks. I was actually interested in your best approach reported above. Will that not work? Or is that the code in #526?

koehlerson · 2024-02-15T18:45:05Z

But we do have two different colorings, where one is from a Bangert Paper implemented by @fredrikekre

PetrKryslUCSD · 2024-02-15T18:52:39Z

But we do have two different colorings, where one is from a Bangert Paper implemented by @fredrikekre

Sorry, I do not follow. If I want to reproduce your results discussed in this thread (where you say you are happy with the speedups), what do I do? Do I follow this recipe #526 (comment)?

PetrKryslUCSD · 2024-02-15T21:22:33Z

If you want to have a better scalable solution in Ferrite

Thanks, I was just a bit unsure what the above meant.

PetrKryslUCSD · 2024-02-16T00:00:10Z

I think the easiest way is to create some empty project, add the code here #526 (comment) into some file (e.g. threaded-assembly-ferrite.jl). In this project you need to add
]add Ferrite#master, SparseArrays, LIKWID, ThreadPinning
and now you should be able to run the program via julia and collect the timings. Note that the scalability is still not great for larger systems, because the coloring approach is rather limited (as pointed out by Kristoffer). If you want to have a better scalable solution in Ferrite then you might need a bit longer, as I currently do not have the time to bring the scalable assembly PR over the finish line too soon. However, this approach will be a bit smaller on systems with a small amount of cores.

Do you remember the version it used to work with? I am getting these errors which seem to indicate the ferrite version I have is too far ahead:

ERROR: LoadError: The `CellVectorValues` interface has been reworked for Ferrite.jl 0.4.0:

 - `CellScalarValues` and `CellVectorValues` have been merged into a single type: `CellValues`
 - "Vectorization" of (scalar) interpolations should now be done on the interpolation
   instead of implicitly in the `CellValues` constructor.

Upgrade as follows:
 - Scalar fields: Replace usage of
       CellScalarValues(quad_rule, interpolation)
   with
       CellValues(quad_rule, interpolation)
 - Vector fields: Replace usage of
       CellVectorValues(quad_rule, interpolation)
   with
       CellValues(quad_rule, interpolation^dim)
   where `dim` is the dimension to vectorize to.

See CHANGELOG.md (https://github.com/Ferrite-FEM/Ferrite.jl/blob/master/CHANGELOG.md) for more details.

PetrKryslUCSD · 2024-02-16T02:01:59Z

I got it running with 0.3.11. Not sure if that's the right thing to do.

termi-official · 2024-02-16T09:23:16Z

Ah, sorry. Indeed I have a local modification to let the example run on master.

using Ferrite, SparseArrays, LIKWID

using ThreadPinning
pinthreads(:cores)

function create_colored_cantilever_grid(celltype, n)
    grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0)))
    colors = create_coloring(grid)
    return grid, colors
end;

# #### DofHandler
function create_dofhandler(grid::Grid{dim}) where {dim}
    dh = DofHandler(grid)
    push!(dh, :u, dim) # Add a displacement field
    close!(dh)
end;

# ### Stiffness tensor for linear elasticity
function create_stiffness(::Val{dim}) where {dim}
    E = 200e9
    ν = 0.3
    λ = E*ν / ((1+ν) * (1 - 2ν))
    μ = E / (2(1+ν))
    δ(i,j) = i == j ? 1.0 : 0.0
    g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k))
    C = SymmetricTensor{4, dim}(g);
    return C
end;

# ## Threaded data structures
#
# ScratchValues is a thread-local collection of data that each thread needs to own,
# since we need to be able to mutate the data in the threads independently
struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti}
    Ke::Matrix{T}
    fe::Vector{T}
    cellvalues::CV
    facevalues::FV
    global_dofs::Vector{Int}
    ɛ::Vector{TT}
    coordinates::Vector{Vec{dim, T}}
    assembler::Ferrite.AssemblerSparsityPattern{T, Ti}
end;

# Each thread need its own CellValues and FaceValues (although, for this example we don't use
# the FaceValues)
function create_values(refshape, dim, order::Int)
    ## Interpolations and values
    interpolation_space = Lagrange{refshape, 1}()
    quadrature_rule = QuadratureRule{refshape}(order)
    face_quadrature_rule = FaceQuadratureRule{refshape}(order)
    cellvalues = [CellValues(quadrature_rule, interpolation_space^dim) for i in 1:Threads.nthreads()];
    facevalues = [FaceValues(face_quadrature_rule, interpolation_space^dim) for i in 1:Threads.nthreads()];
    return cellvalues, facevalues
end;

# Create a `ScratchValues` for each thread with the thread local data
function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim}
    nthreads = Threads.nthreads()
    assemblers = [start_assemble(K, f) for i in 1:nthreads]
    cellvalues, facevalues = create_values(RefHexahedron, dim, 2)

    n_basefuncs = getnbasefunctions(cellvalues[1])
    global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads]

    fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector
    Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads]

    ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads]

    coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads]

    return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i],
                         ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads]
end;

# ## Threaded assemble

# The assembly function loops over each color and does a threaded assembly for that color
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, f::Vector{Float64}, scratches::Vector{SV}, b::Vec{dim}) where {dim, SV}
    ## Each color is safe to assemble threaded
    for color in colors
        ## We try to equipartition the array to increase load per task.
        chunk_size = max(1, 1 + length(color) ÷ Threads.nthreads())
        color_partition = [color[i:min(i + chunk_size - 1, end)] for i in 1:chunk_size:length(color)]
        ## Now we should have a 1:1 correspondence between tasks and elements to assemble.
        Threads.@threads :static for i in 1:length(color_partition)
            for cellid ∈ color_partition[i]
                assemble_cell!(scratches[i], cellid, K, grid, dh, C, b)
            end
        end
    end

    return K, f
end

# The cell assembly function is written the same way as if it was a single threaded example.
# The only difference is that we unpack the variables from our `scratch`.
function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC,
                        grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim}

    ## Unpack our stuff from the scratch
    Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler =
         scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues,
         scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler

    fill!(Ke, 0)
    fill!(fe, 0)

    n_basefuncs = getnbasefunctions(cellvalues)

    ## Fill up the coordinates
    nodeids = grid.cells[cell].nodes
    for j in 1:length(coordinates)
        coordinates[j] = grid.nodes[nodeids[j]].x
    end

    reinit!(cellvalues, coordinates)

    for q_point in 1:getnquadpoints(cellvalues)
        for i in 1:n_basefuncs
            ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i))
        end
        dΩ = getdetJdV(cellvalues, q_point)
        for i in 1:n_basefuncs
            δu = shape_value(cellvalues, q_point, i)
            fe[i] += (δu ⋅ b) * dΩ
            ɛC = ɛ[i] ⊡ C
            for j in 1:n_basefuncs
                Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ
            end
        end
    end

    celldofs!(global_dofs, dh, cell)
    assemble!(assembler, global_dofs, fe, Ke)
end;

function run_assemble()
    refshape = RefHexahedron
    dim = 3
    n = 30
    grid, colors = create_colored_cantilever_grid(Hexahedron, n);
    dh = create_dofhandler(grid);

    K = create_sparsity_pattern(dh);
    C = create_stiffness(Val{3}());
    f = zeros(ndofs(dh))
    scratches = create_scratchvalues(K, f, dh)
    b = Vec{3}((0.0, 0.0, 0.0))
    ## compilation
    doassemble(K, colors, grid, dh, C, f, scratches, b);
    return @perfmon ("FLOPS_DP", "BRANCH", "L2", "L3", "CACHE", "ICACHE", "TLB") doassemble(K, colors, grid, dh, C, f, scratches, b);
    # return doassemble(K, colors, grid, dh, C, f, scratches, b);
end

metrics, events = run_assemble();
clocks = [res["Clock [MHz]"] for res in metrics["FLOPS_DP"]]
println("Clock [MHz] (min, avg, max): ", minimum(clocks), " | ", mean(clocks), " | " , maximum(clocks)) 

thread_times = [res["Runtime unhalted [s]"] for res in metrics["FLOPS_DP"]]
println("Runtime unhalted [s] (min, avg, max): ", minimum(thread_times), " | ", mean(thread_times), " | " , maximum(thread_times))

println("Total runtime [s] ", first([res["Runtime (RDTSC) [s]"] for res in metrics["FLOPS_DP"]]))

10.540680736238489s/0.750866197848292s = 14x

PetrKryslUCSD · 2024-02-16T15:13:11Z

Cool, I will check it out. Thanks, @termi-official !

PetrKryslUCSD · 2024-02-16T16:41:07Z

I ran the simulations on Mac 2 Ultra (no thread pinning).

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

Results with 1, 2, 4, 8, 16 threads:

julia> 12.68 ./ [6.68, 3.61, 1.97, 1.06]
4-element Vector{Float64}:
 1.89820e+00
 3.51247e+00
 6.43655e+00
 1.19623e+01

How did you guys figure the serial execution time?

I took the serial execution from the result with a single thread. However, that is not quite right: the serial execution does not need the scratches, for instance. It doesn't need the colors either. So if the serial execution time goes down, the parallel efficiency will go down, by necessity.

KnutAM · 2024-02-16T19:11:58Z

Not sure how relevant FerriteAssembly is, but since you tagged me: I see quite some difference using this script, which assembles a similar case. (It requires Ferrite#master and adding the unregistered packages FerriteAssembly.jl and MaterialModelsBase.jl). This uses a bit different threading strategy with smaller sized chunks.

I don't count the difference in setup time (e.g. creating coloring and scratches) since the assembly is usually performed so many times.
Based on e.g. #631, I guess that the loop order (which gets messed up for a colored grid) could be partly to blame for the difference between a serial and threaded setup.

Threads	time [s]	Scaling
1(serial)	9.74862	1
1(threaded)	10.32795	0.943906
2	5.467666	1.782958
4	2.77263	3.516019
8	1.392804	6.999275
16	0.713087	13.67102

termi-official · 2024-02-16T20:02:27Z

I ran the simulations on Mac 2 Ultra (no thread pinning).

Doesn't thread pinning give you some extra performance?

How did you guys figure the serial execution time?

I took the serial execution from the result with a single thread. However, that is not quite right: the serial execution does not need the scratches, for instance. It doesn't need the colors either. So if the serial execution time goes down, the parallel efficiency will go down, by necessity.

Since was interested in scalability here I took one thread as reference. But it should not matter as only the assembly loop is measured. For the total program runtime efficiency will indeed change quite a bit, also because everything but the assembly and solve is serial.

PetrKryslUCSD · 2024-02-16T23:08:32Z

Doesn't thread pinning give you some extra performance?

Not available on the mac.

PetrKryslUCSD · 2024-02-17T23:19:59Z

FinEtools threaded assembly. n=50.

pkrysl@samadira exp % ../julia-1.10.0/bin/julia -t 24 batch_def.jl 50
Current folder: /Users/pkrysl/exp
  Activating project at `~/exp`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
    Updating registry at `~/.julia/registries/General.toml`
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
Threads.nthreads() = 24
┌ Warning: Operating system not supported by ThreadPinning.jl. Functions like `pinthreads` will be no-ops!
│ (Hide this warning via `ThreadPinning.Prefs.set_os_warning(false)`.)
└ @ ThreadPinning ~/.julia/packages/ThreadPinning/q7w4s/src/ThreadPinning.jl:93
N = parse(Int, ARGS[1]) = 50
[ Info: Serial Assembly
  0.000011 seconds (3 allocations: 288 bytes)
  0.000000 seconds
 16.624546 seconds (15.34 k allocations: 31.782 GiB, 1.76% gc time, 0.32% compilation time)
[ Info: Parallel Assembly with 1 Tasks
  0.003069 seconds (9 allocations: 85.831 MiB, 6.50% gc time)
  0.036835 seconds (112.54 k allocations: 16.100 GiB, 23.27% gc time, 76.08% compilation time)
  0.000001 seconds
  0.006946 seconds (6.06 k allocations: 406.320 KiB, 98.95% compilation time)
 13.250371 seconds (4.76 M allocations: 345.755 MiB, 0.21% gc time, 3.66% compilation time)
  7.361293 seconds (35 allocations: 20.829 GiB, 4.68% gc time)
[ Info: Serial Assembly
  0.000001 seconds (3 allocations: 288 bytes)
  0.000000 seconds
 16.702665 seconds (315 allocations: 31.781 GiB, 1.67% gc time)
[ Info: Parallel Assembly with 1 Tasks
  0.008336 seconds (9 allocations: 85.831 MiB, 2.72% gc time)
  0.007859 seconds (21 allocations: 16.093 GiB, 98.64% gc time)
  0.000001 seconds
  0.000026 seconds (31 allocations: 1.250 KiB)
 12.822084 seconds (293 allocations: 29.847 MiB)
  7.406948 seconds (35 allocations: 20.829 GiB, 4.91% gc time)
[ Info: Parallel Assembly with 2 Tasks
  0.019266 seconds (17 allocations: 85.832 MiB, 45.63% gc time)
  0.044106 seconds (22 allocations: 16.093 GiB, 99.68% gc time)
  0.000001 seconds
  0.000022 seconds (40 allocations: 1.625 KiB)
  6.823925 seconds (577 allocations: 59.694 MiB)
  7.490230 seconds (35 allocations: 20.829 GiB, 5.18% gc time)
[ Info: Parallel Assembly with 4 Tasks
  0.019106 seconds (33 allocations: 85.832 MiB, 44.28% gc time)
  0.045414 seconds (24 allocations: 16.093 GiB, 99.70% gc time)
  0.000001 seconds
  0.000024 seconds (58 allocations: 2.328 KiB)
  3.699887 seconds (1.15 k allocations: 119.387 MiB)
  8.023389 seconds (35 allocations: 20.829 GiB, 10.19% gc time)
[ Info: Parallel Assembly with 8 Tasks
  0.017301 seconds (65 allocations: 85.834 MiB, 53.32% gc time)
  0.045189 seconds (28 allocations: 16.093 GiB, 99.71% gc time)
  0.000001 seconds
  0.000024 seconds (94 allocations: 3.734 KiB)
  1.954882 seconds (2.28 k allocations: 238.773 MiB)
  8.035546 seconds (35 allocations: 20.829 GiB, 10.06% gc time)
[ Info: Parallel Assembly with 16 Tasks
  0.017221 seconds (130 allocations: 85.838 MiB, 50.30% gc time)
  0.044944 seconds (36 allocations: 16.093 GiB, 99.70% gc time)
  0.000001 seconds
  0.000033 seconds (167 allocations: 6.875 KiB)
  1.503361 seconds (4.55 k allocations: 477.547 MiB)
  7.826032 seconds (35 allocations: 20.829 GiB, 7.49% gc time)

The code:

println("Current folder: $(pwd())")
using Pkg
Pkg.activate(".")
Pkg.instantiate()
Pkg.add("FinEtools")
Pkg.add("FinEtoolsDeforLinear")
Pkg.add("ChunkSplitters")
Pkg.add("ThreadPinning")
Pkg.update()


module par_assembly_examples
using FinEtools
using FinEtoolsDeforLinear
using ChunkSplitters


function run_example(N = 10, ntasks = 2, do_serial = false)
    E = 1000.0
    nu = 0.4999 #Taylor data
    W = 25.0
    H = 50.0
    L = 50.0
    CTE = 0.0

    fens, fes = H8block(W, L, H, N, N, 10*N)

    MR = DeforModelRed3D
    material = MatDeforElastIso(MR, 0.0, E, nu, CTE)
    ir = GaussRule(3, 2)

    geom = NodalField(fens.xyz)
    u = NodalField(zeros(size(fens.xyz, 1), 3)) # displacement field

    numberdofs!(u)

    if do_serial
        @info "Serial Assembly"
        @time femm = FEMMDeforLinear(MR, IntegDomain(fes, ir), material)
        @time associategeometry!(femm, geom)
        @time K = stiffness(femm, geom, u)
    end

    @info "Parallel Assembly with $(ntasks) Tasks"

    function matrixcomputation!(femm, assembler)
        stiffness(femm, assembler, geom, u)
    end

    femms = FEMMDeforLinear[]
    @time for (ch, j) in chunks(1:count(fes), ntasks)
        femm = FEMMDeforLinear(MR, IntegDomain(subset(fes, ch), ir), material)
        associategeometry!(femm, geom)
        push!(femms, femm)
    end

    @time assembler = make_assembler(femms, SysmatAssemblerSparseSymm, u)
    @time start_assembler!(assembler)
    @time assemblers = make_task_assemblers(femms, assembler, SysmatAssemblerSparseSymm, u)
    @time parallel_matrix_assembly(femms, assemblers, matrixcomputation!)
    @time K = make_matrix!(assembler)

    true
end # run_example

end # module


@show Threads.nthreads()
using ThreadPinning
pinthreads(:cores)

@show N = parse(Int, ARGS[1])

using .Main.par_assembly_examples;

Main.par_assembly_examples.run_example(N, 1, true)
Main.par_assembly_examples.run_example(N, 1, true)

for ntasks in [2, 4, 8, 16]
    Main.par_assembly_examples.run_example(N, ntasks, false)
end

nothing

I think it would be interesting if you were to run this on your machine. I believe you have a Linux system with 64 cores?

On the mac 2 ultra I got the speedups

julia> 12.83 ./ [6.82, 3.69, 1.95, 1.50]
4-element Vector{Float64}:
 1.88123e+00
 3.47696e+00
 6.57949e+00
 8.55333e+00

julia>

which isn't that great for 16 threads.

PetrKryslUCSD · 2024-02-17T23:34:57Z

Not sure how relevant FerriteAssembly is, but since you tagged me: I see quite some difference using this script, which assembles a similar case. (It requires Ferrite#master and adding the unregistered packages FerriteAssembly.jl and MaterialModelsBase.jl). This uses a bit different threading strategy with unevenly sized chunks.

I don't count the difference in setup time (e.g. creating coloring and scratches) since the assembly is usually performed so many times. Based on e.g. #631, I guess that the loop order (which gets messed up for a colored grid) could be partly to blame for the difference between a serial and threaded setup.

Threads time [s] Scaling
1(serial) 9.74862 1
1(threaded) 10.32795 0.943906
2 5.467666 1.782958
4 2.77263 3.516019
8 1.392804 6.999275
16 0.713087 13.67102

How different is it from the code posted above here (#526 (comment))?

KnutAM · 2024-02-18T12:40:11Z

How different is it from the code posted above here (#526 (comment))?

The main differences are using Threads.@spawn and much smaller chunks (compared to length(color) ÷ Threads.nthreads() which is used above), aiming to give some load balancing.

The loop structure can be seen here, and the heuristic creation of chunks from color here.

Except for the TaskChunks implementation, the strategy is similar to the fe/spawn branch Fredrik created, see here.

PetrKryslUCSD · 2024-02-19T16:23:16Z

Comparison of performance using Ferrite and FinEtools for the cantilever problem.
Timings for the assembly only.

Mac 2 Ultra, Julia 1.10.

Mesh with n=50.

Number of tasks	Ferrite	FinEtools
1	12.72	11.88
2	6.68	6.34
4	3.62	3.34
8	1.99	1.73
16	0.99	0.91

Mesh with n=90.

Number of tasks	Ferrite	FinEtools
1	74.92	72.36
2	39.13	37.60
4	21.04	20.00
8	11.99	9.93
16	6.21	5.24

PetrKryslUCSD · 2024-02-19T16:42:30Z

Not sure how relevant FerriteAssembly is, but since you tagged me: I see quite some difference using this script, which assembles a similar case. (It requires Ferrite#master and adding the unregistered packages FerriteAssembly.jl and MaterialModelsBase.jl). This uses a bit different threading strategy with unevenly sized chunks.

I tried adding the masters:

Pkg.add(url="https://github.com/Ferrite-FEM/Ferrite.jl.git#master")
Pkg.add(url="https://github.com/KnutAM/FerriteAssembly.jl.git#master")
Pkg.add(url="https://github.com/KnutAM/MaterialModelsBase.jl.git#master")

Unfortunately, I am getting an error:

What are the versions I should use?
Thanks.

KnutAM · 2024-02-19T16:47:42Z

pkg> add Ferrite#master
pkg> add https://github.com/KnutAM/MaterialModelsBase.jl.git
pkg> add https://github.com/KnutAM/FerriteAssembly.jl.git

Should hopefully work

PetrKryslUCSD · 2024-02-19T16:53:25Z

Goody, I just had those two switched. Now it worked.

PetrKryslUCSD · 2024-02-19T16:56:54Z

Impressive speedups, @KnutAM !

julia> 3.65 ./ [1.83, 0.94, 0.51, 0.27]    
4-element Vector{Float64}:
 1.99454e+00
 3.88298e+00
 7.15686e+00
 1.35185e+01

Sorry: I made an error setting up the grid. Rerunning now.
New measurements:

pkrysl@samadira new_ferrite % tail bench_thread_ferrite_assembly.txt
1 17.414321708 17.413865375 17.416540792
2 8.54891225 8.545035375 8.561896166
4 4.388465875 4.385740916 4.394549167
8 2.349980041 2.357513333 2.317562125
16 1.244327208 1.243291333 1.239926916

Overall slower than #526 (comment), but excellent speedup (17.41/1.24=1.40403e+01)!

KnutAM · 2024-02-19T18:42:12Z

Thanks for trying it out @PetrKryslUCSD!
The latter timing series makes more sense, yes! I suspect the difference is due to assembling the internal force vector and the stiffness. An updated test case implements the equivalent element routine, which at least runs faster on my system.

High performance on FinEtools - nicely done! I'll try to run that later!
Do I understand the code correctly that I should consider the time for parallel_matrix_assembly + make_matrix! for comparison with the Ferrite times?

PetrKryslUCSD · 2024-02-19T18:44:50Z

This is my script:

using Pkg
Pkg.activate("."); Pkg.instantiate()

Pkg.add(url="https://github.com/Ferrite-FEM/Ferrite.jl.git#master")
Pkg.add(url="https://github.com/KnutAM/MaterialModelsBase.jl.git")
Pkg.add(url="https://github.com/KnutAM/FerriteAssembly.jl.git")

Pkg.add("ThreadPinning")

using ThreadPinning
pinthreads(:cores)

using Ferrite
using FerriteAssembly
import FerriteAssembly.ExampleElements: LinearElastic

threading = Threads.nthreads() > 0
n = 50
grid = generate_grid(Hexahedron, (10*n,n,n))

ip = Lagrange{RefHexahedron,1}()^3
qr = QuadratureRule{RefHexahedron}(2)
dh = DofHandler(grid)
add!(dh, :u, ip)
close!(dh)
cv = CellValues(qr, ip)
material = LinearElastic(E=200e9, ν=0.3)
buffer = setup_domainbuffer(DomainSpec(dh, material, cv); threading)

K = create_sparsity_pattern(dh)
r = zeros(ndofs(dh))
assembler = start_assemble(K, r)

work!(assembler, buffer)

function dotiming(K, r, buffer)
    assembler = start_assemble(K, r)
    @elapsed work!(assembler, buffer)
end

open("bench_thread_ferrite_assembly.txt"; append=true) do fid
    write(fid, string(Threads.nthreads()))
    write(fid, " ")
    for _ in 1:3
        t = dotiming(K, r, buffer)
        write(fid, string(t))
        write(fid, " ")
    end
    write(fid, "\n")
end

PetrKryslUCSD · 2024-02-19T18:57:08Z

The newest FinEtools script:

println("Current folder: $(pwd())")
using Pkg
Pkg.activate(".")
Pkg.instantiate()
Pkg.add(url="https://github.com/PetrKryslUCSD/FinEtools.jl.git#main")
Pkg.add(url="https://github.com/PetrKryslUCSD/FinEtoolsDeforLinear.jl.git")
Pkg.add("ChunkSplitters")
Pkg.add("SymRCM")
Pkg.add("ThreadPinning")
Pkg.update()
Pkg.status()


module par_assembly_examples
using FinEtools
using FinEtoolsDeforLinear
using ChunkSplitters
using SymRCM

function run_example(N = 10, ntasks = 2, do_serial = false)
    E = 1000.0
    nu = 0.4999 #Taylor data
    W = 25.0
    H = 50.0
    L = 50.0
    CTE = 0.0

    fens, fes = H8block(W, L, H, N, N, 10*N)
    #C = connectionmatrix(FEMMBase(IntegDomain(fes, GaussRule(3, 1))), count(fens))
    #ordering = symrcm(C)
    #fens, fes = reordermesh(fens, fes, ordering)

    MR = DeforModelRed3D
    material = MatDeforElastIso(MR, 0.0, E, nu, CTE)
    ir = GaussRule(3, 2)

    geom = NodalField(fens.xyz)
    u = NodalField(zeros(size(fens.xyz, 1), 3)) # displacement field

    numberdofs!(u)

    if do_serial
        @info "Serial Assembly"
        @time femm = FEMMDeforLinear(MR, IntegDomain(fes, ir), material)
        @time associategeometry!(femm, geom)
        @time K = stiffness(femm, geom, u)
    end

    @info "Parallel Assembly with $(ntasks) Tasks"

    function matrixcomputation!(femm, assembler)
        stiffness(femm, assembler, geom, u)
    end

    femms = FEMMDeforLinear[]
    @time for (ch, j) in chunks(1:count(fes), ntasks)
        femm = FEMMDeforLinear(MR, IntegDomain(subset(fes, ch), ir), material)
        associategeometry!(femm, geom)
        push!(femms, femm)
    end

    @time assembler = make_assembler(femms, SysmatAssemblerSparseSymm, u)
    @time start_assembler!(assembler)
    @time assemblers = make_task_assemblers(femms, assembler, SysmatAssemblerSparseSymm, u)
    @time parallel_matrix_assembly(femms, assemblers, matrixcomputation!)
    @time K = make_matrix!(assembler)

    true
end # run_example

end # module


@show Threads.nthreads()
using ThreadPinning
pinthreads(:cores)

@show N = parse(Int, ARGS[1])
@show ntasks = parse(Int, ARGS[2])

using .Main.par_assembly_examples;

Main.par_assembly_examples.run_example(N, 1, true)
Main.par_assembly_examples.run_example(N, 1, true)

Main.par_assembly_examples.run_example(N, ntasks, false)
Main.par_assembly_examples.run_example(N, ntasks, false)

nothing

PetrKryslUCSD · 2024-02-21T01:52:33Z

@termi-official and @KnutAM : Is there enough novelty in your investigations of parallel matrix assembly for a paper? If so, would you be interested in writing something? Let me know your thoughts please at [email protected]. I look forward to it. P

PetrKryslUCSD · 2024-02-21T17:23:35Z

Do I understand the code correctly that I should consider the time for parallel_matrix_assembly + make_matrix! for comparison with the Ferrite times?

Sorry, I just noticed this. No, this should be equivalent to Ferrite create sparsity pattern + assembly.

termi-official · 2024-02-21T17:52:53Z

@termi-official and @KnutAM : Is there enough novelty in your investigations of parallel matrix assembly for a paper? If so, would you be interested in writing something? Let me know your thoughts please at [email protected]. I look forward to it. P

The short answer is: No, there is not even incremental research happening right now.

We are merely trying to reproduce a subset of the results from the WorkStream paper and investigate bottlenecks in our implementation (since our implementation underperformed for some reason I cannot fully explain when I opened the thread). There is nothing novel happening here which is not already described in literature about multithreaded assembly. And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce). But thanks for the offer @PetrKryslUCSD , I appreciate it!

PetrKryslUCSD · 2024-02-21T17:55:01Z

Understood. Thanks.

Could you by any chance point to the work from CEED? Thanks.

termi-official · 2024-02-21T20:09:38Z

I think a good start is https://doi.org/10.1016/j.parco.2021.102841 . A more exhaustive list should be here https://ceed.exascaleproject.org/pubs/ .

PetrKryslUCSD · 2024-02-22T00:17:34Z

An additional data point: FinEtools assembly only, 64-core Opteron machine with 1, 2, 4, 8, 16, 64 threads:

julia> 64.96466  ./ [35.713732, 15.687828, 9.211306, 4.647433, 2.38525, 1.358766]
6-element Vector{Float64}:
 1.81904e+00
 4.14109e+00
 7.05271e+00
 1.39786e+01
 2.72360e+01
 4.78115e+01

PetrKryslUCSD · 2024-02-24T17:37:28Z

And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).

I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?

PetrKryslUCSD · 2024-03-05T16:52:18Z

@termi-official Ping...

termi-official · 2024-03-06T20:16:56Z

And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).

I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?

GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability.

PetrKryslUCSD · 2024-03-07T03:52:24Z

I think their solution is really not to build a matrix at all. So, good, but not a silver bullet...

…

On Wed, Mar 6, 2024, 12:17 PM Dennis Ogiermann ***@***.***> wrote: And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce). I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper? GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/Ferrite-FEM/Ferrite.jl/issues/526*issuecomment-1981708828__;Iw!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8b02IWKo$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACLGGWAYXZ7YO3C4OSD6UEDYW52UHAVCNFSM6AAAAAARKEVDIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRG4YDQOBSHA__;!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8bH-RMZM$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

termi-official added help wanted performance labels Oct 20, 2022

termi-official closed this as completed Feb 9, 2024

Threaded Assembly Performance Degradation #526

Threaded Assembly Performance Degradation #526

Comments

termi-official commented Oct 20, 2022 • edited Loading

TODOs

koehlerson commented Oct 20, 2022

KnutAM commented Oct 20, 2022 • edited Loading

koehlerson commented Oct 20, 2022 • edited Loading

koehlerson commented Oct 20, 2022

KristofferC commented Oct 20, 2022 • edited Loading

koehlerson commented Oct 20, 2022

termi-official commented Oct 20, 2022 • edited Loading

termi-official commented Jan 20, 2023

lijas commented Feb 15, 2023

termi-official commented Feb 15, 2023

KnutAM commented Feb 16, 2023

termi-official commented Feb 16, 2023

KristofferC commented Feb 16, 2023

termi-official commented Feb 27, 2023

PetrKryslUCSD commented Mar 14, 2023

PetrKryslUCSD commented Mar 17, 2023

termi-official commented Mar 17, 2023

KnutAM commented May 28, 2023 • edited Loading

termi-official commented May 28, 2023 • edited Loading

termi-official commented Oct 19, 2023

termi-official commented Oct 26, 2023

termi-official commented Feb 9, 2024

PetrKryslUCSD commented Feb 15, 2024

termi-official commented Feb 15, 2024

PetrKryslUCSD commented Feb 15, 2024 • edited Loading

koehlerson commented Feb 15, 2024

PetrKryslUCSD commented Feb 15, 2024 • edited Loading

PetrKryslUCSD commented Feb 15, 2024

PetrKryslUCSD commented Feb 16, 2024

PetrKryslUCSD commented Feb 16, 2024

termi-official commented Feb 16, 2024

PetrKryslUCSD commented Feb 16, 2024

PetrKryslUCSD commented Feb 16, 2024

KnutAM commented Feb 16, 2024 • edited Loading

termi-official commented Feb 16, 2024 • edited Loading

PetrKryslUCSD commented Feb 16, 2024

PetrKryslUCSD commented Feb 17, 2024 • edited Loading

PetrKryslUCSD commented Feb 17, 2024

KnutAM commented Feb 18, 2024

PetrKryslUCSD commented Feb 19, 2024

PetrKryslUCSD commented Feb 19, 2024

KnutAM commented Feb 19, 2024

PetrKryslUCSD commented Feb 19, 2024

PetrKryslUCSD commented Feb 19, 2024 • edited Loading

KnutAM commented Feb 19, 2024

PetrKryslUCSD commented Feb 19, 2024

PetrKryslUCSD commented Feb 19, 2024

PetrKryslUCSD commented Feb 21, 2024

PetrKryslUCSD commented Feb 21, 2024

termi-official commented Feb 21, 2024

PetrKryslUCSD commented Feb 21, 2024

termi-official commented Feb 21, 2024

PetrKryslUCSD commented Feb 22, 2024 • edited Loading

PetrKryslUCSD commented Feb 24, 2024

PetrKryslUCSD commented Mar 5, 2024

termi-official commented Mar 6, 2024

PetrKryslUCSD commented Mar 7, 2024 via email

termi-official commented Oct 20, 2022 •

edited

Loading

KnutAM commented Oct 20, 2022 •

edited

Loading

koehlerson commented Oct 20, 2022 •

edited

Loading

KristofferC commented Oct 20, 2022 •

edited

Loading

termi-official commented Oct 20, 2022 •

edited

Loading

KnutAM commented May 28, 2023 •

edited

Loading

termi-official commented May 28, 2023 •

edited

Loading

PetrKryslUCSD commented Feb 15, 2024 •

edited

Loading

PetrKryslUCSD commented Feb 15, 2024 •

edited

Loading

KnutAM commented Feb 16, 2024 •

edited

Loading

termi-official commented Feb 16, 2024 •

edited

Loading

PetrKryslUCSD commented Feb 17, 2024 •

edited

Loading

PetrKryslUCSD commented Feb 19, 2024 •

edited

Loading

PetrKryslUCSD commented Feb 22, 2024 •

edited

Loading