-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fix: disable concurrency in GFS_phys_time_vary_init NetCDF calls AND #2040 and #2043 #2041
bug fix: disable concurrency in GFS_phys_time_vary_init NetCDF calls AND #2040 and #2043 #2041
Conversation
This PR is innocuous and self-contained. It can be combined with just about anything. Ideally, it should be merged with PRs that don't change the results. |
@SamuelTrahanNOAA I've been talking with the NetCDF folks and they have been working on a threadsafe NetCDF, and we're just finding out if that work has been merged or will be soon. The process to move to whatever version of NetCDF that is will of course take a while. That was FYI. My question is if this code is easily "reversible" once the thread-safe NetCDF is available? |
The thread-safe NetCDF may not be available on all machines. We would need a way to turn on and off threading during that region. Turning it back on 100% of the time is a simple matter of putting the OpenMP directives back in. I don't think we'll want it on unconditionally. How to turn it on conditionally... we need to research and design. The NetCDF library would have to be able to report that it is thread safe. (Even if the code is capable of being built thread-safe, it may not necessarily build that way everywhere. It might be a compile-time flag when building the library.) Certainly, we need a flag to force the code to run single-threaded regardless of NetCDF library, in case something goes wrong. And then there's the matter of deciding how to turn off threading in that region. |
There's one other consideration. We're better off using netcdf-parallel to read across all ranks. It'll stress the filesystem less than we're doing now. Ultimately, it is the fastest option. This would be done at the model level, where we read most fix files already. |
I've merged the top of #2040 and #2043 into this PR's branch. Unfortunately, two hashes mismatch for #2043:
The one in #2043 doesn't exist in the repository. I'm pointing to NOAA-PSL/stochastic_physics#70 instead. I need the correct hash from @yuanxue2870 before I continue testing. |
It turns out this hash is correct: See that page for the discussion. |
@DusanJovic-NOAA - Could you please confirm I merged your changes into this branch correctly? |
I think you need to update .gitmodules to point to your fv3atm fork/branch. |
I'm still getting a failure for regional_atmaq on Gaea C5, here: And the regional_netcdf_parallel_intel failure on Hercules from issue 2015 continues, and I haven't been able to get it to pass yet. It seems to be more persistent than in the past. |
Nothing in this PR should affect that job. I have a ticket open for a similar UPP failure. With certain inputs, UPP will freeze forever in MPI_Finalize. This only happens on GAEA-C5. The admins have passed the ticket on to the vendor. I haven't heard back in several weeks.
The NetCDF fixes in here are for reading NetCDF files, not writing. I don't expect it to fix #2015 |
I added a commit message to the PR description:
|
Could @DusanJovic-NOAA and @yuanxue2870 please review the commit message?
When this PR is merged, that text will be the description of the changes recorded forever in the repository. |
@SamuelTrahanNOAA You make a good point about "combined" PRs. In the future, the PR selected to carry the individual PRs will be able to just grab and add. |
Looks good. |
Looks good to me.
Thanks,
Yuan
…-------------------------------------------------
Yuan Xue
Physical Scientist III - Land Data Assimilation
Lynker at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2870
College Park, MD 20740
***@***.***
-------------------------------------------------
On Tue, Dec 19, 2023 at 10:44 AM Dusan Jovic ***@***.***> wrote:
Could @DusanJovic-NOAA <https://github.com/DusanJovic-NOAA> and
@yuanxue2870 <https://github.com/yuanxue2870> please review the commit
message?
General clean-up, and CCPP concurrency bug fixes
1. Remove nfhout, nfhmax_hf, nfhout_hf and nsout from fv3atm and the regression tests.
2. Add comments to smc pert and fix bug in stc pert
3. Disable concurrency in NetCDF calls within CCPP GFS_phys_time_vary_init subroutine to avoid crashes
When this PR is merged, that text will be the description of the changes
recorded forever in the repository.
Looks good.
—
Reply to this email directly, view it on GitHub
<#2041 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAUAX4D4RJOI7LD6PC4MYOTYKGY53AVCNFSM6AAAAABAVWF4K6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTGAYTAMRZGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
General question here about commit messages, since this PR is next. Should we include in all commit messages which subcomponents were changed, say a line like "Components Updated: FV3, stochastic physics" ? |
…s-weather-model into init-concurrency-bug
@SamuelTrahanNOAA I've made changes to your commit messages as I want to see how the links for the issues travel inside the commit message. I hope that this works. |
This PR is ready for final testing and merge. I have reverted .gitmodules. All subcomponent hashes match the head of their correct branch. |
PR Author Checklist:
Description
NOTE: Two PRs are included in this one
Occasionally, the model will fail in FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90 during the first physics timestep. This is caused by heap corruption earlier, during gfs_phys_time_vary_init. After much debugging, I tracked down the problem.
The NetCDF library is not thread-safe:
We're calling a non-thread-safe library in a threaded region. There are multiple NetCDF calls going concurrently. The fix is to read the NetCDF files one at a time.
Commit Message
General clean-up, and CCPP concurrency bug fixes
Linked Issues and Pull Requests
Associated UFSWM Issue to close
These two PRs are combined with this one. They should be closed when this is merged.
Subcomponent Pull Requests
Blocking Dependencies
None
Subcomponents involved:
Anticipated Changes
Input data
Regression Tests:
Libraries
Code Managers Log
Testing Log: