-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135
Comments
lrbison
added a commit
to lrbison/noahmp
that referenced
this issue
Jul 17, 2024
|
we will merge the PR very soon after some internal testing and we will close this issue once it is merged. thank you! |
This bug fix has been included in the recent commit. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue appears when running WRF in dm+sm mode. It was reported on aarch4 (Graviton3: neoverse-v1). The symptom is that WRF calls MPI_Abort, but doesn't print any message. However re-running the same input often succeeds, and failures only happen occasionally (typically on the first timestep).
Upon further investigation, it seems that the non-master thread is calling wrf_error_fatal from here: https://github.com/NCAR/noahmp/blob/release-v4.5-WRF/src/module_sf_noahmplsm.F#L1727 however none of the messages are printed, because in wrf_message, all output is guarded by an
!$OMP MASTER
block, and it seems the error is being triggered from non-master threads.With the print enabled, we found a few grid points would occasionally lose water in the order of >.1 but <1 kg/m^2/dt. Investigation into the error cause showed that the scalar terms contributing to the water balance were identical between failing and successful runs. The primary difference was in the soil moisture. Diffing the output dataset showed no corrupt-looking data, only small differences induced by the stochastic energy flux methods.
Eventually I discovered what I believe to be the root cause:
calculate_soil
is being assigned twice withinnoahmplsm
. First it is set to.false.
then if a modulo is 0, then it is set to.true.
. However the variable is scoped to the whole module, so all threads share the storage ofcalculate_soil
. This leaves the potential for thread B to have passed this initialization block, and try to use the value while thread A is between the .false. and .true. assignments, resulting in an inconsistent value ofcalculate_soil
to be observed by thread B during the subroutine execution.The text was updated successfully, but these errors were encountered: