-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evp1d failures #623
Comments
Some ideas carried from #568. If we believe the 2d implementation is robust, then this suggests there may still be a bug in the 1d implementation that's possibly triggered only on timescales of months to years. All our bit-for-bit testing with debug flags has been shorter than a year. We could run QC with debug flags on with 1d and 2d and see what happens. I guess you can't restart the QC from shortly before the crash with debug flags on, and expect it to crash again. Would that be worth a try? The first thing I might try is to turn debug on with 1d and 2d and run 5 years, 1 year at a time with restarts, just to see if the models are bit-for-bit throughout and to see if the 1d with debug also fails. Then we might try running evp1d with optimization but threading off. This might provide insight about an OpenMP issues. Depending what we learn there, we might create a case with a restart just before the failure and then start debugging the actual abort. Is this the type of error you get for CFL violations in CESM? I'm wondering if the 1D evp QC test just happens to be hitting one of these, and the 2D case barely misses it. (The "incremental" part of incremental remap assumes that ice moves no farther than 1 grid cell in 1 time step.) If restarting the 1D case with a reduced timestep runs through this point, CFL could be the culprit -- there might not be anything wrong with the 1D evp implementation at all, just unlucky. |
One other thing to add from #568. The QC error with 1d evp was a robust failure with both 9x4 (threaded) and 36x1 (no threading) which suggests OpenMP is not the problem. |
While we're thinking about evp1d. We should also add some additional tests like evp1d + revised evp. |
I can run that tonight if there is room on hpc |
I have run all test with revised evp for 1d and 2d. All test pass for both intel and gnu with full debug flags on |
The same is happening testing in #621 without evp1d on. This looks like a bigger problem than evp1d. All tests fail at the end of 2008 with 2005 cycling data. I think this may be a calendar issue. The QC tests have leap years on but we're cycling 2005 data. We probably need to set leap years off for QC tests from now on. This problem may have come in when we cleaned up the time manager. I will do additional testing. |
I have this figured out and will be submitting a PR for it. It turns out the problem is with the forcing and model calendar being out of sync with leap years. The errors were not limited to the evp1d runs, but that didn't become clear until today. |
Perfect. |
I reran the QC tests with evp1d with the calendar fix and everything is working fine and passes. The base uses the standard_2d evp solver and the test uses the shared_mem_1d solver. I tested with both OpenMP on and off for the evp1d.
Once the fix is merged to master, we can close this issue. |
This is a follow up to #568 and #279.
The QC test with the evp1d failed after 4 years with
This needs to be further investigated.
The text was updated successfully, but these errors were encountered: