Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run the entire advection loop on device #34

Open
wants to merge 17 commits into
base: develop-1.3
Choose a base branch
from

Conversation

awnawab
Copy link
Contributor

@awnawab awnawab commented Oct 8, 2024

This PR hoists the data transfers out of the advection loop, making the entire timestep device resident and resulting in big reductions in the data offload cost. The attached plot shows the effects of the optimisations included herein.

The most invasive change is to OUTBLOCK. Whilst not a very computationally heavy routine, it needs to run on device in order to prevent copying all the state fields back to host. The indirection + accumulation pattern in OUTBLOCK wasn't compatible with how scalars are privatised to GPU threads. @jkousal32 your recent ifs-source PR conflicts with some of these changes, I'll resolve that conflict once I merge this back into develop.

One more optimisation PR is planned after this, which will focus on improving the GPU performance of the physics kernel.

ecwam_gpu

@awnawab awnawab changed the title Naan port wamodel Run the entire advection loop on device Oct 8, 2024
@awnawab awnawab marked this pull request as ready for review October 8, 2024 13:05
@awnawab
Copy link
Contributor Author

awnawab commented Oct 8, 2024

@jksoual32 and @jrbidlot, could you please also review this PR? Many thanks!

@jrbidlot
Copy link
Contributor

jrbidlot commented Oct 8, 2024

Ahmad,
silly question. Changing outblock affects the output, many of those are pure diagnostic parameters.
Have you checked that the output file id the same in both cases?

@awnawab
Copy link
Contributor Author

awnawab commented Oct 8, 2024

Ahmad, silly question. Changing outblock affects the output, many of those are pure diagnostic parameters. Have you checked that the output file id the same in both cases?

The validation passes both on CPU and GPU after these changes, so I am reasonably confident I haven't changed the results. But of course I will double check and run checksums on the output file before and after this change.

Just to be sure, the relevant file to check would be the path stored in CNORMWAMOUT_FILE right?

@awnawab
Copy link
Contributor Author

awnawab commented Oct 9, 2024

Ahmad, silly question. Changing outblock affects the output, many of those are pure diagnostic parameters. Have you checked that the output file id the same in both cases?

Just confirmed (using sha256sum) that the contents of CNORMWAMOUT_FILE are identical on CPU before and after this PR. On GPU they are not, but that is expected because now that we run more of ecWAM on device, we will get small round-off differences.

Copy link

@jkousal32 jkousal32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks okay to me!

Copy link
Contributor

@jrbidlot jrbidlot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ahmad,
I think there is a bug in outblock around old line 555

The tricky bit with outblock is that it produces many purely diagnostic output and so it would not be picked-up by checking the standard output norms.

Did you test with all possible output turned on?

IR=IR+1
IF (IPFGTBL(IR) /= 0) THEN
CALL SEBTMEAN (KIJS, KIJL, FL2ND, TEWH(IH-1), TEWH(IH), BOUT(KIJS,ITOBOUT(IR)))
IF (IPFGTBL(59) /= 0) THEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code incremented IR from 59 to 59+NTEWH, but the new code only has 59 as an index and so all output will go to index 59
The afterward 60 and so one will pick up the wrong parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent catch! There was another similar loop over NTRAIN earlier that I also missed. I've now fixed it and confirmed that the norms are identical even if all 87 output parameters are enabled. Many thanks for spotting this 🙏

Copy link
Contributor

@jrbidlot jrbidlot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got it working! Very good.

It is always tricky when dealing with the diagnostic output.
We need to remember to trigger the full output from the model in that case.

Copy link
Contributor

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants