-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement transeq in omp backend #27
Conversation
still WIP, compiles but not tested
src/omp/backend.f90
Outdated
subroutine transeq_omp_dist(self, du, dv, dw, u, v, w, dirps) | ||
implicit none | ||
|
||
class(omp_backend_t) :: self | ||
class(field_t), intent(inout) :: du, dv, dw | ||
class(field_t), intent(in) :: u, v, w | ||
type(dirps_t), intent(in) :: dirps | ||
class(field_t), pointer :: duu, d2u, uu, du_temp | ||
|
||
! du | ||
du_temp => self%allocator%get_block() | ||
call tds_solve_omp(self, du_temp, u, dirps, dirps%der1st) | ||
|
||
duu => self%allocator%get_block() | ||
uu => self%allocator%get_block() | ||
call vecmul_omp(uu, u, u, dirps) | ||
call tds_solve_omp(self, duu, uu, dirps, dirps%der1st_sym) | ||
|
||
d2u => self%allocator%get_block() | ||
call tds_solve_omp(self, d2u, u, dirps, dirps%der2nd) | ||
|
||
|
||
|
||
|
||
end subroutine transeq_omp_dist | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the way we want to implement transeq stuff on the OpenMP backend. We can't take advantage of the cache and reduce the data movements if we implement in this way.
We should instead implement a subroutine that is similar to exec_dist_tds_compact.
https://github.com/xcompact3d/x3d2/blob/main/src/omp/exec_dist.f90#L14C15-L14C36
And the idea is we call the der_univ_dist to get du/dx, duu/dx, and d2u/dx2 one after another as we loop through batches of lines we have in the domain. (So a loop over the 3rd dimension of the arrays we have, aka the group number.
It should look like the loop below, but with 3 distinct calls to der_univ_dist inside the loop with correct set of tdsops instances.
https://github.com/xcompact3d/x3d2/blob/main/src/omp/exec_dist.f90#L40-L47
This will also make an operation like vecmul_omp unnecessary, saving some data movement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, makes sense indeed. I will do that
still WIP, not tested and not cleaned up
Just reimplemented following your comment. Note that, for now, I have kept the Still not ready to be merged, I need to cleanup the naming convention I used, test it and add a test. |
also moves halo exchange to its own function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I think now we get all the benefit from CPU cache.
src/omp/backend.f90
Outdated
call sendrecv_fields(self%w_recv_s, self%w_recv_e, self%w_send_s, self%w_send_e, & | ||
SZ*n_halo*dirps%n_blocks, dirps%nproc, dirps%pprev, dirps%pnext) | ||
|
||
end subroutine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to omit subroutine name here? On github it broke the syntax highlighting for me in the subroutine below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is never a requirement (maybe it used to be in older standards?). It is weird indeed that github syntax highlight doesn't handle it. I can add it if needed.
test fails for now, either bug in implementation or in the test itself
I would only test |
Also, its a good idea to test the performance of the |
Added a test for it too. Now we are testing both. |
Now that we've decided to separate performance and verification/unit test. I think I will leave that for now until we have a framework in place. (see #35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few small suggestions.
for consistency with the rest of the codebase Co-authored-by: Jamie J Quinn <[email protected]>
for consistency with the rest of the codebase Co-authored-by: Jamie J Quinn <[email protected]>
Distributed solver requires up to 128/256 points per rank based on the particular compact scheme we solve. If you want to do parallel tests the test can fail if you have too few points in a rank. We don't have a parallel testing enviroment yet but can you confirm that the test passes when you run the executable by hand with multiple ranks? Because you use the default schemes I think 64 points per rank should be more than enough. |
Yes, exactly. With 64 cells, it was failing with 4 cores (error ~ 7e-8) and working on 2. That's why I have increased it to 96 now so it works up to 4 cores with the tolerance that was set (1e-8). |
I lowered the cell count just to run them faster. Went from 2s to 0.4 I think. For now it isn't a big deal, but when we will have 10 or 20 tests, that quickly adds up. Being able to run many tests quickly makes you run them more often. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the temporaries private inside the parallel loops is very important, and once it is fixed I'm happy to merge. We can work on the performance related part in a new PR.
|
||
!$omp parallel do | ||
do k = 1, n_block | ||
call der_univ_subs(du(:, :, k), & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised that here we're writing 3 field sized arrays into main memory unnecessarily. It is potentially increasing the runtime %20.
In the second phase of the algorithm here we pass a part of the du
, dud
, and d2u
into der_univ_subs, and they're all rewritten in place. Then later we combine them in rhs
for the final result. Ideally, we want du
, dud
, and d2u
to be read once and rhs
to be written only once. However because of the way der_univ_subs work, the updated data in du
arrays after der_univ_subs call gets written in the main memory, even though we don't need this at all.
There are three ways we can fix this
- In the parallel do loop in the second phase we can copy the relevant parts of
du
,dud
, andd2u
arrays into(SZ, n)
sized temporary arrays. Then we pass temporary arrays into der_univ_subs, and at the end we use these temporaries to obtain finalrhs
. This is the easiest solution but it may not be the best in terms of performance. - We can write an alternative der_univ_subs and separate input and output arrays. This way we can pass a part of the
du
arrays as we do now, and pass a small temporary array as the output one. Becausedu
arrays will be input arrays no data will be written in main memory. Then we can combine the temporaries to getrhs
. - If we're writing an alternative der_univ_subs to be used in transeq, we can go one step further and have a fused version of it. This would probably the most performant solution. der_univ_subs is relatively lightweight so it isn't really hard to do so. The new subrotuine can input all
du
,dud
, andd2u
, and write the final resultrhs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#40 is relevant here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, I will have a think about it, but indeed having it in a new PR focusing on performance makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think best way to move forward is implementing the first strategy and checking how much of the peak BW we get. If its a good utilisation then maybe we don't need to bother with a new der_univ_subs at all.
closes #21