Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of AMR with P4estMesh #627

Closed
efaulhaber opened this issue Jun 1, 2021 · 12 comments · Fixed by #638
Closed

Improve performance of AMR with P4estMesh #627

efaulhaber opened this issue Jun 1, 2021 · 12 comments · Fixed by #638
Labels

Comments

@efaulhaber
Copy link
Member

efaulhaber commented Jun 1, 2021

Currently, AMR with TreeMesh is a lot faster than with P4estMesh.
A quick comparison using elixir_advection_amr.jl (with t_end=9.9999 because P4estMesh does one more time step than TreeMesh with t_end=10):

TreeMesh
 ─────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                   Allocations
                                      ──────────────────────   ───────────────────────
           Tot / % measured:               188ms / 93.6%           15.6MiB / 99.2%

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 rhs!                          1.60k    113ms  64.5%  70.8μs    586KiB  3.69%     375B
   volume integral             1.60k   54.4ms  31.0%  34.0μs     0.00B  0.00%    0.00B
   interface flux              1.60k   23.9ms  13.6%  14.9μs     0.00B  0.00%    0.00B
   surface integral            1.60k   11.4ms  6.49%  7.12μs     0.00B  0.00%    0.00B
   prolong2interfaces          1.60k   7.13ms  4.06%  4.46μs     0.00B  0.00%    0.00B
   prolong2mortars             1.60k   5.17ms  2.94%  3.23μs     0.00B  0.00%    0.00B
   mortar flux                 1.60k   4.90ms  2.79%  3.06μs     0.00B  0.00%    0.00B
   Jacobian                    1.60k   3.28ms  1.87%  2.05μs     0.00B  0.00%    0.00B
   ~rhs!~                      1.60k   2.06ms  1.17%  1.28μs    586KiB  3.69%     375B
   reset ∂u/∂t                 1.60k    839μs  0.48%   524ns     0.00B  0.00%    0.00B
   prolong2boundaries          1.60k   65.0μs  0.04%  40.6ns     0.00B  0.00%    0.00B
   source terms                1.60k   55.9μs  0.03%  34.9ns     0.00B  0.00%    0.00B
   boundary flux               1.60k   49.7μs  0.03%  31.0ns     0.00B  0.00%    0.00B
 AMR                              63   27.9ms  15.9%   442μs   13.2MiB  85.2%   215KiB
   coarsen                        63   13.1ms  7.47%   208μs   7.14MiB  46.1%   116KiB
     solver                       63   8.59ms  4.89%   136μs   5.18MiB  33.4%  84.2KiB
     mesh                         63   3.29ms  1.87%  52.2μs    302KiB  1.90%  4.79KiB
     ~coarsen~                    63   1.24ms  0.71%  19.7μs   1.67MiB  10.8%  27.1KiB
   refine                         63   12.7ms  7.25%   202μs   5.57MiB  35.9%  90.6KiB
     solver                       63   9.27ms  5.28%   147μs   5.06MiB  32.6%  82.2KiB
     mesh                         63   2.73ms  1.55%  43.3μs    459KiB  2.89%  7.28KiB
       refine_unbalanced!         63   2.25ms  1.28%  35.7μs   8.59KiB  0.05%     140B
       ~mesh~                     63    414μs  0.24%  6.57μs    348KiB  2.19%  5.53KiB
       rebalance!                 63   67.1μs  0.04%  1.07μs    102KiB  0.64%  1.62KiB
     ~refine~                     63    728μs  0.41%  11.6μs   69.2KiB  0.44%  1.10KiB
   indicator                      63   1.26ms  0.72%  20.0μs     0.00B  0.00%    0.00B
   ~AMR~                          63    771μs  0.44%  12.2μs    516KiB  3.25%  8.20KiB
 analyze solution                  5   17.4ms  9.91%  3.48ms   78.6KiB  0.49%  15.7KiB
 I/O                              11   15.8ms  8.99%  1.44ms   1.17MiB  7.53%   109KiB
   ~I/O~                          11   10.4ms  5.94%   948μs    399KiB  2.51%  36.2KiB
   save solution                   5   5.19ms  2.96%  1.04ms    756KiB  4.76%   151KiB
   get element variables           5    165μs  0.09%  32.9μs   40.5KiB  0.25%  8.10KiB
   save mesh                       5    600ns  0.00%   120ns     0.00B  0.00%    0.00B
 initial condition AMR             1   1.07ms  0.61%  1.07ms    484KiB  3.05%   484KiB
   AMR                             3    803μs  0.46%   268μs    484KiB  3.04%   161KiB
     refine                        3    643μs  0.37%   214μs    354KiB  2.23%   118KiB
       solver                      2    329μs  0.19%   165μs    329KiB  2.07%   164KiB
       mesh                        2    292μs  0.17%   146μs   21.8KiB  0.14%  10.9KiB
         refine_unbalanced!        2    269μs  0.15%   135μs      672B  0.00%     336B
         ~mesh~                    2   20.4μs  0.01%  10.2μs   16.2KiB  0.10%  8.10KiB
         rebalance!                2   2.80μs  0.00%  1.40μs   4.94KiB  0.03%  2.47KiB
       ~refine~                    3   21.7μs  0.01%  7.23μs   3.11KiB  0.02%  1.04KiB
     indicator                     3    123μs  0.07%  41.1μs   8.12KiB  0.05%  2.71KiB
     ~AMR~                         3   36.1μs  0.02%  12.0μs    121KiB  0.76%  40.4KiB
     coarsen                       3    300ns  0.00%   100ns      240B  0.00%    80.0B
   ~initial condition AMR~         1    268μs  0.15%   268μs      848B  0.01%     848B
 calculate dt                    321    169μs  0.10%   527ns     0.00B  0.00%    0.00B
 ─────────────────────────────────────────────────────────────────────────────────────
P4estMesh
 ────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                      Time                   Allocations
                                     ──────────────────────   ───────────────────────
          Tot / % measured:               534ms / 97.6%            163MiB / 100%

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 AMR                             63    241ms  46.3%  3.83ms    158MiB  96.8%  2.50MiB
   refine                        63    118ms  22.6%  1.87ms   73.6MiB  45.2%  1.17MiB
     solver                      63    105ms  20.2%  1.67ms   73.1MiB  44.9%  1.16MiB
     mesh                        63   12.6ms  2.41%   199μs    507KiB  0.30%  8.05KiB
       rebalance                126   11.1ms  2.13%  87.9μs   1.97KiB  0.00%    16.0B
       refine                    63    815μs  0.16%  12.9μs     0.00B  0.00%    0.00B
       ~mesh~                    63    665μs  0.13%  10.6μs    505KiB  0.30%  8.02KiB
     ~refine~                    63   53.8μs  0.01%   854ns   1.66KiB  0.00%    26.9B
   coarsen                       63    111ms  21.4%  1.77ms   71.9MiB  44.1%  1.14MiB
     solver                      63   99.3ms  19.1%  1.58ms   70.1MiB  43.0%  1.11MiB
     mesh                        63   12.0ms  2.30%   190μs   1.78MiB  1.09%  29.0KiB
       rebalance                126   10.0ms  1.92%  79.2μs   1.97KiB  0.00%    16.0B
       ~mesh~                    63   1.44ms  0.28%  22.8μs   1.27MiB  0.78%  20.6KiB
       coarsen!                  63    554μs  0.11%  8.79μs    526KiB  0.31%  8.34KiB
     ~coarsen~                   63   54.4μs  0.01%   863ns   1.66KiB  0.00%    26.9B
   indicator                     63   11.6ms  2.23%   185μs   12.3MiB  7.52%   199KiB
   ~AMR~                         63    394μs  0.08%  6.25μs   2.48KiB  0.00%    40.4B
 rhs!                         1.60k    232ms  44.4%   145μs    586KiB  0.35%     375B
   volume integral            1.60k   84.4ms  16.2%  52.7μs     0.00B  0.00%    0.00B
   interface flux             1.60k   72.5ms  13.9%  45.3μs     0.00B  0.00%    0.00B
   prolong2interfaces         1.60k   34.4ms  6.61%  21.5μs     0.00B  0.00%    0.00B
   mortar flux                1.60k   11.7ms  2.25%  7.32μs     0.00B  0.00%    0.00B
   surface integral           1.60k   10.8ms  2.07%  6.73μs     0.00B  0.00%    0.00B
   prolong2mortars            1.60k   8.23ms  1.58%  5.14μs     0.00B  0.00%    0.00B
   Jacobian                   1.60k   6.11ms  1.17%  3.82μs     0.00B  0.00%    0.00B
   ~rhs!~                     1.60k   2.21ms  0.42%  1.38μs    586KiB  0.35%     375B
   reset ∂u/∂t                1.60k    877μs  0.17%   548ns     0.00B  0.00%    0.00B
   prolong2boundaries         1.60k   76.4μs  0.01%  47.7ns     0.00B  0.00%    0.00B
   source terms               1.60k   50.5μs  0.01%  31.5ns     0.00B  0.00%    0.00B
   boundary flux              1.60k   50.0μs  0.01%  31.2ns     0.00B  0.00%    0.00B
 I/O                             11   19.1ms  3.67%  1.74ms   1.03MiB  0.63%  95.7KiB
   ~I/O~                         11   13.4ms  2.58%  1.22ms    255KiB  0.15%  23.2KiB
   save solution                  5   5.52ms  1.06%  1.10ms    756KiB  0.45%   151KiB
   get element variables          5    175μs  0.03%  35.0μs   40.5KiB  0.02%  8.10KiB
   save mesh                      5    900ns  0.00%   180ns     0.00B  0.00%    0.00B
 analyze solution                 5   18.1ms  3.47%  3.62ms    464KiB  0.28%  92.7KiB
 calculate dt                   321   7.23ms  1.39%  22.5μs     0.00B  0.00%    0.00B
 initial condition AMR            1   4.08ms  0.78%  4.08ms   3.15MiB  1.93%  3.15MiB
   AMR                            3   3.83ms  0.74%  1.28ms   3.15MiB  1.93%  1.05MiB
     refine                       3   3.37ms  0.65%  1.12ms   2.57MiB  1.57%   876KiB
       solver                     3   2.83ms  0.54%   945μs   2.54MiB  1.56%   867KiB
       mesh                       3    534μs  0.10%   178μs   24.1KiB  0.01%  8.05KiB
         rebalance                6    459μs  0.09%  76.6μs     96.0B  0.00%    16.0B
         refine                   3   45.3μs  0.01%  15.1μs     0.00B  0.00%    0.00B
         ~mesh~                   3   28.9μs  0.01%  9.63μs   24.0KiB  0.01%  8.02KiB
       ~refine~                   3   2.20μs  0.00%   733ns   1.66KiB  0.00%     565B
     indicator                    3    429μs  0.08%   143μs    503KiB  0.30%   168KiB
     ~AMR~                        3   31.0μs  0.01%  10.3μs   98.5KiB  0.06%  32.8KiB
     coarsen                      3    200ns  0.00%  66.7ns      240B  0.00%    80.0B
   ~initial condition AMR~        1    248μs  0.05%   248μs      848B  0.00%     848B
 ────────────────────────────────────────────────────────────────────────────────────

rhs! is expected to be slower since P4estMesh is treated as a curved mesh.
AMR is also expected to be slower because the curved data structures that need to be rebuilt are a lot more complex than the ones used by TreeMesh.
However, AMR with P4estMesh can most likely be optimized to be at least twice as fast as it is now.

I found two things in particular that are slowing down AMR a lot and to which I don't know a solution yet.
Firstly, calc_jacobian_matrix! in the CurvedMesh data structures (which is used by P4estMesh as well).
This function consists of these four lines:

@views mul!(jacobian_matrix[1, 1, :, :, element], basis.derivative_matrix, node_coordinates[1, :, :, element]) # x_ξ
@views mul!(jacobian_matrix[2, 1, :, :, element], basis.derivative_matrix, node_coordinates[2, :, :, element]) # y_ξ
@views mul!(jacobian_matrix[1, 2, :, :, element], node_coordinates[1, :, :, element], basis.derivative_matrix') # x_η
@views mul!(jacobian_matrix[2, 2, :, :, element], node_coordinates[2, :, :, element], basis.derivative_matrix') # y_η

Using mul! generally doesn't allocate, but it does when used with views like this.

Secondly, when initializing P4estMesh data structures, user data needs to be passed to another function through p4est.
This user data needs to provide the interfaces container, some interface ID, and the mesh. Currently, I'm packing these to a Vector{Any}, to which a pointer is passed to p4est.

user_data = [interfaces, 1, mesh]

In other functions, I get this as a Ptr{nothing}, which I need to unpack.
I am currently doing this as follows.
ptr = Ptr{Any}(user_data)
data_array = unsafe_wrap(Array, ptr, 3)
interfaces = data_array[1]
interface_id = data_array[2]
data_array[2] += 1
mesh = data_array[3]

However, this is not type-stable. I added a function barrier in this PR, but this is still not optimal.

How could I further optimize these two functions?

@sloede
Copy link
Member

sloede commented Jun 2, 2021

Some thoughts, questions:

  1. Why does AMR take only about one third of the time per call during the initial refinement? Do you have an explanation for that?
  2. Does the runtime of AMR scale linearly with the number of cells/interfaces? Or is there a large constant fixed cost independent of the problem size?
  3. How do you know that the iter_faces part is the problem? From what I can see, the solver part makes up ~205 ms of the total of 240 ms spend in the AMR timer. Then there's ~20 ms in rebalance and ~10 ms in the indicator. When trying to optimize anything here, I would first focus on the solver part, since any relative improvement there is going to have the biggest absolute impact overall...

@ranocha
Copy link
Member

ranocha commented Jun 2, 2021

  1. Why does AMR take only about one third of the time per call during the initial refinement? Do you have an explanation for that?

It's only refining initially, isn't it?

@sloede
Copy link
Member

sloede commented Jun 2, 2021

Yes, it could be that it's really just the number of cells involved being smaller in the beginning.

@ranocha
Copy link
Member

ranocha commented Jun 2, 2021

@efaulhaber If I were you, I would

  • check where allocations from solver in refine/coarsen are coming from.
  • fix them whenever (reasonably) possible
  • run partial benchmarks (BenchmarkTools.jl or BenchmarkHistograms.jl) or profile code (ProfileView.jl) to compare AMR for the TreeMesh and the P4estMesh in more details
  • attack the most expensive parts one after another

@efaulhaber
Copy link
Member Author

Most of the time spent in the solver part is due to the reinitializing of the containers. init_elements! takes the most time there, mostly because of calc_jacobian_matrix!.
Most of the time spent initializing interfaces, boundaries and mortars is spent in the iter_faces functions (that's how these are initialized).

@efaulhaber
Copy link
Member Author

Here's another timer output where I timed reinitialize_containers! more thoroughly.

 ──────────────────────────────────────────────────────────────────────────────────────────────────
                      Trixi.jl                             Time                   Allocations
                                                   ──────────────────────   ───────────────────────
                 Tot / % measured:                      583ms / 97.8%            163MiB / 100%

 Section                                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────────────
 AMR                                           63    284ms  49.8%  4.50ms    158MiB  96.8%  2.51MiB
   refine                                      63    148ms  26.0%  2.35ms   73.6MiB  45.2%  1.17MiB
     solver                                    63    136ms  23.9%  2.16ms   73.1MiB  44.9%  1.16MiB
       reinitialize_containers!                63    134ms  23.6%  2.13ms   69.8MiB  42.8%  1.11MiB
         init_elements!                        63   83.9ms  14.7%  1.33ms   41.7MiB  25.6%   678KiB
           calc_jacobian_matrix!            28.3k   50.8ms  8.92%  1.79μs   24.2MiB  14.8%     896B
           calc_node_coordinates!              63   15.5ms  2.72%   246μs   17.5MiB  10.8%   285KiB
           ~init_elements!~                    63   7.51ms  1.32%   119μs   3.31KiB  0.00%    53.8B
           calc_inverse_jacobian!           28.3k   5.38ms  0.94%   190ns     0.00B  0.00%    0.00B
           calc_contravariant_vectors!      28.3k   4.74ms  0.83%   167ns     0.00B  0.00%    0.00B
         init_interfaces!                      63   11.2ms  1.96%   177μs   6.67MiB  4.09%   108KiB
         count_required_interfaces             63   10.9ms  1.91%   172μs   9.03MiB  5.54%   147KiB
         init_mortars!                         63   9.51ms  1.67%   151μs   7.00MiB  4.29%   114KiB
         count_required_mortars                63   8.00ms  1.41%   127μs   5.28MiB  3.24%  85.9KiB
         init_boundaries!                      63   5.38ms  0.94%  85.4μs   10.8KiB  0.01%     176B
         count_required_boundaries             63   5.16ms  0.91%  81.9μs   5.91KiB  0.00%    96.0B
         ~reinitialize_containers!~            63    155μs  0.03%  2.46μs   10.5KiB  0.01%     170B
         resize elements                       63   74.1μs  0.01%  1.18μs   38.4KiB  0.02%     624B
         resize interfaces                     63   68.4μs  0.01%  1.09μs   21.7KiB  0.01%     352B
         resize mortars                        63   29.2μs  0.01%   463ns   21.7KiB  0.01%     352B
         resize boundaries                     63   19.7μs  0.00%   313ns   6.89KiB  0.00%     112B
       ~solver~                                63   1.81ms  0.32%  28.8μs   3.32MiB  2.04%  53.9KiB
     mesh                                      63   12.1ms  2.13%   193μs    507KiB  0.30%  8.05KiB
       rebalance                              126   10.7ms  1.88%  85.2μs   1.97KiB  0.00%    16.0B
       refine                                  63    783μs  0.14%  12.4μs     0.00B  0.00%    0.00B
       ~mesh~                                  63    626μs  0.11%  9.94μs    505KiB  0.30%  8.02KiB
     ~refine~                                  63   37.1μs  0.01%   589ns   1.66KiB  0.00%    26.9B
   coarsen                                     63    122ms  21.4%  1.93ms   71.9MiB  44.1%  1.14MiB
     solver                                    63    110ms  19.3%  1.74ms   70.1MiB  43.0%  1.11MiB
       reinitialize_containers!                63    108ms  18.9%  1.71ms   66.6MiB  40.9%  1.06MiB
         init_elements!                        63   59.6ms  10.5%   947μs   39.8MiB  24.4%   647KiB
           calc_jacobian_matrix!            27.0k   28.4ms  4.99%  1.05μs   23.1MiB  14.2%     896B
           calc_node_coordinates!              63   14.4ms  2.54%   229μs   16.7MiB  10.3%   272KiB
           ~init_elements!~                    63   7.23ms  1.27%   115μs   3.31KiB  0.00%    53.8B
           calc_inverse_jacobian!           27.0k   5.14ms  0.90%   190ns     0.00B  0.00%    0.00B
           calc_contravariant_vectors!      27.0k   4.41ms  0.77%   163ns     0.00B  0.00%    0.00B
         init_interfaces!                      63   10.6ms  1.86%   168μs   6.36MiB  3.90%   103KiB
         count_required_interfaces             63   10.4ms  1.82%   165μs   8.62MiB  5.29%   140KiB
         init_mortars!                         63   9.17ms  1.61%   146μs   6.64MiB  4.08%   108KiB
         count_required_mortars                63   7.72ms  1.36%   123μs   5.04MiB  3.09%  81.9KiB
         init_boundaries!                      63   5.20ms  0.91%  82.6μs   10.8KiB  0.01%     176B
         count_required_boundaries             63   4.95ms  0.87%  78.6μs   5.91KiB  0.00%    96.0B
         ~reinitialize_containers!~            63    134μs  0.02%  2.12μs   10.5KiB  0.01%     170B
         resize elements                       63   47.3μs  0.01%   751ns   38.4KiB  0.02%     624B
         resize interfaces                     63   32.4μs  0.01%   514ns   21.7KiB  0.01%     352B
         resize mortars                        63   28.3μs  0.00%   449ns   21.7KiB  0.01%     352B
         resize boundaries                     63   14.1μs  0.00%   224ns   6.89KiB  0.00%     112B
       ~solver~                                63   1.82ms  0.32%  28.8μs   3.53MiB  2.16%  57.3KiB
     mesh                                      63   12.0ms  2.11%   191μs   1.78MiB  1.09%  29.0KiB
       rebalance                              126   10.0ms  1.75%  79.1μs   1.97KiB  0.00%    16.0B
       ~mesh~                                  63   1.44ms  0.25%  22.9μs   1.27MiB  0.78%  20.6KiB
       coarsen!                                63    622μs  0.11%  9.88μs    526KiB  0.31%  8.34KiB
     ~coarsen~                                 63   33.1μs  0.01%   525ns   1.66KiB  0.00%    26.9B
   indicator                                   63   13.3ms  2.33%   210μs   12.3MiB  7.52%   199KiB
   ~AMR~                                       63    390μs  0.07%  6.20μs   2.48KiB  0.00%    40.4B
 rhs!                                       1.60k    232ms  40.8%   145μs    586KiB  0.35%     375B
   volume integral                          1.60k   84.5ms  14.8%  52.8μs     0.00B  0.00%    0.00B
   interface flux                           1.60k   73.5ms  12.9%  45.9μs     0.00B  0.00%    0.00B
   prolong2interfaces                       1.60k   34.5ms  6.06%  21.6μs     0.00B  0.00%    0.00B
   mortar flux                              1.60k   11.5ms  2.01%  7.17μs     0.00B  0.00%    0.00B
   surface integral                         1.60k   10.9ms  1.91%  6.78μs     0.00B  0.00%    0.00B
   prolong2mortars                          1.60k   8.11ms  1.42%  5.07μs     0.00B  0.00%    0.00B
   Jacobian                                 1.60k   6.09ms  1.07%  3.80μs     0.00B  0.00%    0.00B
   ~rhs!~                                   1.60k   2.21ms  0.39%  1.38μs    586KiB  0.35%     375B
   reset ∂u/∂t                              1.60k    863μs  0.15%   539ns     0.00B  0.00%    0.00B
   prolong2boundaries                       1.60k   73.4μs  0.01%  45.8ns     0.00B  0.00%    0.00B
   boundary flux                            1.60k   52.5μs  0.01%  32.8ns     0.00B  0.00%    0.00B
   source terms                             1.60k   51.3μs  0.01%  32.0ns     0.00B  0.00%    0.00B
 analyze solution                               5   21.9ms  3.84%  4.37ms    457KiB  0.27%  91.3KiB
 I/O                                           11   19.2ms  3.37%  1.74ms   1.03MiB  0.63%  95.7KiB
   ~I/O~                                       11   13.4ms  2.36%  1.22ms    255KiB  0.15%  23.2KiB
   save solution                                5   5.54ms  0.97%  1.11ms    756KiB  0.45%   151KiB
   get element variables                        5    195μs  0.03%  39.0μs   40.5KiB  0.02%  8.10KiB
   save mesh                                    5   1.30μs  0.00%   260ns     0.00B  0.00%    0.00B
 calculate dt                                 321   7.45ms  1.31%  23.2μs     0.00B  0.00%    0.00B
 initial condition AMR                          1   4.92ms  0.86%  4.92ms   3.17MiB  1.94%  3.17MiB
   AMR                                          3   4.67ms  0.82%  1.56ms   3.17MiB  1.94%  1.06MiB
     refine                                     3   4.07ms  0.71%  1.36ms   2.58MiB  1.58%   880KiB
       solver                                   3   3.54ms  0.62%  1.18ms   2.55MiB  1.57%   872KiB
         reinitialize_containers!               2   3.48ms  0.61%  1.74ms   2.45MiB  1.50%  1.22MiB
           init_elements!                       2   1.97ms  0.35%   987μs   1.18MiB  0.73%   605KiB
             calc_jacobian_matrix!            800    968μs  0.17%  1.21μs    700KiB  0.42%     896B
             calc_node_coordinates!             2    472μs  0.08%   236μs    507KiB  0.30%   254KiB
             ~init_elements!~                   2    213μs  0.04%   107μs   3.31KiB  0.00%  1.66KiB
             calc_contravariant_vectors!      800    162μs  0.03%   202ns     0.00B  0.00%    0.00B
             calc_inverse_jacobian!           800    159μs  0.03%   198ns     0.00B  0.00%    0.00B
           init_interfaces!                     2    340μs  0.06%   170μs    194KiB  0.12%  97.1KiB
           count_required_interfaces            2    309μs  0.05%   155μs    263KiB  0.16%   132KiB
           init_mortars!                        2    242μs  0.04%   121μs    194KiB  0.12%  97.1KiB
           count_required_mortars               2    208μs  0.04%   104μs    152KiB  0.09%  76.2KiB
           init_boundaries!                     2    155μs  0.03%  77.5μs      352B  0.00%     176B
           count_required_boundaries            2    147μs  0.03%  73.6μs      192B  0.00%    96.0B
           resize elements                      2   73.3μs  0.01%  36.6μs    385KiB  0.23%   193KiB
           resize interfaces                    2   12.1μs  0.00%  6.05μs   80.8KiB  0.05%  40.4KiB
           ~reinitialize_containers!~           2   10.3μs  0.00%  5.15μs   10.5KiB  0.01%  5.24KiB
           resize mortars                       2   4.60μs  0.00%  2.30μs   14.0KiB  0.01%  6.98KiB
           resize boundaries                    2   1.10μs  0.00%   550ns      224B  0.00%     112B
         ~solver~                               3   64.4μs  0.01%  21.5μs    109KiB  0.07%  36.4KiB
       mesh                                     3    526μs  0.09%   176μs   24.1KiB  0.01%  8.05KiB
         rebalance                              6    452μs  0.08%  75.4μs     96.0B  0.00%    16.0B
         refine                                 3   44.8μs  0.01%  14.9μs     0.00B  0.00%    0.00B
         ~mesh~                                 3   29.5μs  0.01%  9.83μs   24.0KiB  0.01%  8.02KiB
       ~refine~                                 3   1.30μs  0.00%   433ns   1.66KiB  0.00%     565B
     indicator                                  3    569μs  0.10%   190μs    503KiB  0.30%   168KiB
     ~AMR~                                      3   30.9μs  0.01%  10.3μs   98.5KiB  0.06%  32.8KiB
     coarsen                                    3    400ns  0.00%   133ns      240B  0.00%    80.0B
   ~initial condition AMR~                      1    253μs  0.04%   253μs      848B  0.00%     848B
 ──────────────────────────────────────────────────────────────────────────────────────────────────

A considerable amount of time comes from init_elements, especially calc_jacobian_matrix!, and all count_required and init_interfaces/boundaries/mortars (which are all internally iter_faces calls) add up to a similar amount of time.

I also confirmed this with ProfileView, but I don't see a way to post this here properly.

@sloede
Copy link
Member

sloede commented Jun 2, 2021

OK, thanks for doing these first analysis steps. If it is not too difficult, it would be good to post the commands you used for these benchmarks here such that we pick this up at a later point.

I think the good result is that most time is spent in the solver where we have more room for improvements, than directly in p4est. The rebalance cost will be cut in half as well, once cburstedde/p4est#112 is resolved.

Thus for now I think this can be left as-is, and you can further concentrate on new features.

@efaulhaber
Copy link
Member Author

I just ran elixir_advection_amr.jl with mesh = TreeMesh(... replaced by

mesh = P4estMesh((1, 1), polydeg=3,
                 coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                 initial_refinement_level=4)

and added some more @timed.
To benchmark reinitialize_containers!, which is where most of the time is spent during AMR, one can use

@benchmark Trixi.reinitialize_containers!($mesh, $equations, $solver, $semi.cache)

@ranocha
Copy link
Member

ranocha commented Jun 4, 2021

Request my review on your related PR when you feel ready for that and I can try to have a look to speed up critical parts

@efaulhaber
Copy link
Member Author

Alright, thanks!

@ranocha
Copy link
Member

ranocha commented Jun 6, 2021

I ran your benchmark setup above and have some prototype code which reduces the time from

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                         1.61k    232ms  33.8%   145μs   53.5MiB  29.9%  34.1KiB
[...]
 AMR                             64    221ms  32.2%  3.46ms    121MiB  67.7%  1.89MiB
   refine                        64    125ms  18.2%  1.95ms   54.7MiB  30.6%   875KiB
     solver                      64    113ms  16.4%  1.76ms   54.2MiB  30.3%   867KiB
     mesh                        64   12.2ms  1.78%   191μs    515KiB  0.28%  8.05KiB
       rebalance                128   11.2ms  1.63%  87.4μs   2.00KiB  0.00%    16.0B
       refine                    64    577μs  0.08%  9.02μs     0.00B  0.00%    0.00B
       ~mesh~                    64    467μs  0.07%  7.29μs    513KiB  0.28%  8.02KiB
     ~refine~                    64   34.8μs  0.01%   544ns   1.66KiB  0.00%    26.5B
   coarsen                       64   85.8ms  12.5%  1.34ms   53.9MiB  30.1%   862KiB
     solver                      64   73.1ms  10.6%  1.14ms   52.1MiB  29.1%   833KiB
     mesh                        64   12.6ms  1.84%   197μs   1.81MiB  1.01%  29.0KiB
       rebalance                128   10.9ms  1.59%  85.3μs   2.00KiB  0.00%    16.0B
       ~mesh~                    64   1.11ms  0.16%  17.3μs   1.29MiB  0.72%  20.6KiB
       coarsen!                  64    607μs  0.09%  9.49μs    534KiB  0.29%  8.34KiB
     ~coarsen~                   64   36.6μs  0.01%   572ns   1.66KiB  0.00%    26.5B
   indicator                     64   10.4ms  1.51%   162μs   12.4MiB  6.96%   199KiB
   ~AMR~                         64    243μs  0.04%  3.79μs   2.48KiB  0.00%    39.8B
 initial condition AMR            1   16.6ms  2.42%  16.6ms   2.59MiB  1.45%  2.59MiB
   AMR                            3   15.6ms  2.27%  5.19ms   2.59MiB  1.45%   885KiB
     refine                       3   13.4ms  1.95%  4.48ms   2.01MiB  1.12%   684KiB
       solver                     3   11.0ms  1.61%  3.68ms   1.98MiB  1.11%   676KiB
       mesh                       3   2.37ms  0.35%   791μs   24.1KiB  0.01%  8.05KiB
         rebalance                6   2.15ms  0.31%   359μs     96.0B  0.00%    16.0B
         refine                   3    134μs  0.02%  44.6μs     0.00B  0.00%    0.00B
         ~mesh~                   3   86.6μs  0.01%  28.9μs   24.0KiB  0.01%  8.02KiB
       ~refine~                   3   8.06μs  0.00%  2.69μs   1.66KiB  0.00%     565B
     indicator                    3   2.07ms  0.30%   691μs    503KiB  0.27%   168KiB
     ~AMR~                        3   74.9μs  0.01%  25.0μs   98.5KiB  0.05%  32.8KiB
     coarsen                      3   1.28μs  0.00%   428ns      240B  0.00%    80.0B
   ~initial condition AMR~        1   1.07ms  0.16%  1.07ms      848B  0.00%     848B

in your PR to

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                         1.61k    232ms  42.5%   144μs   53.5MiB  62.1%  34.1KiB
[...]
 AMR                             64    124ms  22.7%  1.93ms   29.8MiB  34.6%   477KiB
   refine                        64   59.0ms  10.8%   922μs   8.02MiB  9.30%   128KiB
     solver                      64   45.8ms  8.40%   715μs   7.51MiB  8.71%   120KiB
     mesh                        64   13.2ms  2.42%   206μs    515KiB  0.58%  8.05KiB
       rebalance                128   12.1ms  2.22%  94.5μs   2.00KiB  0.00%    16.0B
       refine                    64    596μs  0.11%  9.31μs     0.00B  0.00%    0.00B
       ~mesh~                    64    502μs  0.09%  7.85μs    513KiB  0.58%  8.02KiB
     ~refine~                    64   29.7μs  0.01%   465ns   1.66KiB  0.00%    26.5B
   coarsen                       64   55.0ms  10.1%   860μs   9.32MiB  10.8%   149KiB
     solver                      64   42.0ms  7.71%   657μs   7.51MiB  8.71%   120KiB
     mesh                        64   13.0ms  2.38%   202μs   1.81MiB  2.10%  29.0KiB
       rebalance                128   10.9ms  2.00%  85.3μs   2.00KiB  0.00%    16.0B
       ~mesh~                    64   1.19ms  0.22%  18.7μs   1.29MiB  1.49%  20.6KiB
       coarsen!                  64    849μs  0.16%  13.3μs    534KiB  0.60%  8.34KiB
     ~coarsen~                   64   32.7μs  0.01%   511ns   1.66KiB  0.00%    26.5B
   indicator                     64   9.50ms  1.74%   149μs   12.4MiB  14.4%   199KiB
   ~AMR~                         64    234μs  0.04%  3.66μs   2.48KiB  0.00%    39.8B
 initial condition AMR            1   3.17ms  0.58%  3.17ms   1.30MiB  1.50%  1.30MiB
   AMR                            3   2.88ms  0.53%   959μs   1.30MiB  1.50%   442KiB
     refine                       3   2.42ms  0.44%   806μs    725KiB  0.82%   242KiB
       solver                     3   1.74ms  0.32%   579μs    700KiB  0.79%   233KiB
       mesh                       3    680μs  0.12%   227μs   24.1KiB  0.03%  8.05KiB
         rebalance                6    612μs  0.11%   102μs     96.0B  0.00%    16.0B
         refine                   3   42.4μs  0.01%  14.1μs     0.00B  0.00%    0.00B
         ~mesh~                   3   26.1μs  0.00%  8.69μs   24.0KiB  0.03%  8.02KiB
       ~refine~                   3   2.33μs  0.00%   776ns   1.66KiB  0.00%     565B
     indicator                    3    437μs  0.08%   146μs    503KiB  0.57%   168KiB
     ~AMR~                        3   22.1μs  0.00%  7.37μs   98.5KiB  0.11%  32.8KiB
     coarsen                      3    352ns  0.00%   117ns      240B  0.00%    80.0B
   ~initial condition AMR~        1    291μs  0.05%   291μs      848B  0.00%     848B

using julia --num-threads=1 --check-bounds=no. I will create a PR once #618 is merged into dev and this part is finished (maybe in a few days). I would like to turn this into some blog post (like a tutorial) to explain the steps and my reasoning behind some changes if it's okay for you.

@efaulhaber
Copy link
Member Author

Perfect, thank you! A blog post sounds awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants