Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

Open
lrbison opened this issue Jul 17, 2024 · 2 comments · Fixed by #136
Open

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

lrbison opened this issue Jul 17, 2024 · 2 comments · Fixed by #136

Comments

@lrbison
Copy link

lrbison commented Jul 17, 2024

This issue appears when running WRF in dm+sm mode. It was reported on aarch4 (Graviton3: neoverse-v1). The symptom is that WRF calls MPI_Abort, but doesn't print any message. However re-running the same input often succeeds, and failures only happen occasionally (typically on the first timestep).

Upon further investigation, it seems that the non-master thread is calling wrf_error_fatal from here: https://github.com/NCAR/noahmp/blob/release-v4.5-WRF/src/module_sf_noahmplsm.F#L1727 however none of the messages are printed, because in wrf_message, all output is guarded by an !$OMP MASTER block, and it seems the error is being triggered from non-master threads.

With the print enabled, we found a few grid points would occasionally lose water in the order of >.1 but <1 kg/m^2/dt. Investigation into the error cause showed that the scalar terms contributing to the water balance were identical between failing and successful runs. The primary difference was in the soil moisture. Diffing the output dataset showed no corrupt-looking data, only small differences induced by the stochastic energy flux methods.

Eventually I discovered what I believe to be the root cause: calculate_soil is being assigned twice within noahmplsm. First it is set to .false. then if a modulo is 0, then it is set to .true.. However the variable is scoped to the whole module, so all threads share the storage of calculate_soil. This leaves the potential for thread B to have passed this initialization block, and try to use the value while thread A is between the .false. and .true. assignments, resulting in an inconsistent value of calculate_soil to be observed by thread B during the subroutine execution.

@lrbison
Copy link
Author

lrbison commented Jul 22, 2024

PR is merged. Thank you! oops, misread that

@lrbison lrbison closed this as completed Jul 22, 2024
@lrbison lrbison reopened this Jul 22, 2024
@cenlinhe
Copy link
Collaborator

we will merge the PR very soon after some internal testing and we will close this issue once it is merged. thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants