-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HEMCO restart tests fail #856
Comments
Following up about the HEMCO restart issue, I figured out the problem is vertical regridding of 3D emissions using instantaneous surface pressure. The specific problem is that the minimum read frequency for HEMCO is 1 hour on the hour, i.e. 0:00, 1:00, etc., but restarted runs which use a sub-hour timestep, e.g. 30 min, begin off the hour, i.e. 0:00 first run (okay), but then 3:30, 6:30, etc. I am able to eliminate differences by changing the timestep to 1 hour. This results in vertical regridding using identical pressures across all runs since no regridding occurs off the hour in any of the runs. Solutions within HEMCO could include storage of raw file data within HEMCO to be regridded each timestep, or sub-hour reads. Neither are great from a performance point-of-view. Another option, which is simpler but would impact you, is having restarted runs begin on the hour rather than the first timestep after the hour. This is what we do in offline GEOS-Chem and within GEOS and hence we have never had this problem before. Note that all of my testing using CAM-chem not GEOS-Chem. Is there a reason you start the subsequent runs at the timestep after the time corresponding to the restart file ( To be more clear about how runs are currently split up in time within CAM, here is a breakdown of my own 30 min timestep runs. Compset: FCHIST_HCO Run A – seven full timesteps (in GEOS-Chem CTM, HEMCO standalone, and GEOS this would only be six full timesteps) Run B (restarted) – six full timesteps All additional subsequent runs are also six full timesteps. Why does the initial run have seven? Using six would solve the HEMCO problem. It is also more intuitive for a 3 hour run with 30 minute time intervals to execute 6 full timesteps. |
Another solution could be to store the last hour's edge pressures in the restart file and use those for the initial regrid upon restart. A cleaner solution, however, is to start runs that are restarted at integer number of hours on the hour rather than offset. |
That's great news that you tracked down this issue, @lizziel. @brian-eaton I don't remember why the model integrates for a half time-step past STOP_N. Can you refresh my memory? Thanks! |
In case it helps, here is what the execution would look like to fix the problem: Run A – six full timesteps Run B (restarted) – six full timesteps etc. |
The ERP_Ln9_* test does an initial run of 5 steps and then does a 4 step restart. For a model configuration with a 1/2 hour timestep, that means the model is restarting at 9000 seconds (you can see this in the timestamp of the restart file) and so HEMCO is initializing to a different state for timestep 6 than it would have if it were doing a continuous 6 timestep run. This clearly won't pass the restart test. The fact that CAM's timeloop does some calculations (and writes a confusing message about it to the log file) after the restart file is written and before control returns to the coupler is not relevant to this restart failure. The calculations that are done after the restart file is written are redone during the initialization phase of a restart run (don't ask why:). CAM has historically required being able to restart from an arbitrary timestep. The benefit of that is mainly for debugging purposes. It sounds like HEMCO could be made to restore its state at a non-hour boundary if it In the meantime I think we should adjust the restart tests with HEMCO to only restart on an hour boundary. I don't know the cime testing infrastructure well enough to know the best way to do this, but I assume it's not hard. One thing keep on the radar though is that the following tests seem to randomly fail baseline comparisions, and this is not related to restarts: |
From a practical and user-friendly perspective, this seems like a requirement we should keep! We don't need the added hassle of making sure STOP_N lands on the hour for any arbitrary ATM_NCPL, especially if this becomes the default emissions driver in CAM7. Thanks for the clarification Brian. |
@brian-eaton - I would suggest we NOT change the HEMCO test so that it uses a setup which passes. Since HEMCO is targeted to be a default setup in CESM, I want to make sure we remember it is broken until it is fixed. I know it is a pain to have a continuously failing regression test, but in this case, I think it is important to keep it failing until a fix is provided for HEMCO. |
I would argue that having a test that fails for a known reason can mask an unanticipated failure. I think tests that pass are more informative. A way to keep track of a feature we want that doesn't exist is through an open issue. |
I support saving the last hour edge pressure to use with HEMCO so that runs are allowed to start/stop off the hour. Separate from this I am still confused about why my first run completed 7 full timesteps plus 1 partial. In my tests I did not use the ERP in the test suite but instead manually created two cases. One of the tests used STOP_N set to 3 while the other had it set to 6. I ran the one with STOP_N=3 twice, first using initial values second using CONTINUE_RUN=TRUE. Unlike the ERP test I would think this would do an even number of 30min timesteps. Is there a namelist parameter I am not setting correctly? |
@lizziel, if your tests are on derecho and you point me to them I'll take a look. |
@brian-eaton, thank you for taking a look. My single (6hr) and double (two 3hr) runs, both 30min and 1hr timesteps, are archived here:
I wrote a script to compare the cam.r.* files. Results are below. The first few lines show the files being compared. Results of 30min timestep test:
Results of 1hr timestep test:
Two consecutive run logs using 30min timestep, with grep results The log for the first 30min timestep run:
The log for the second 30min timestep run:
A 3rd run (not saved) starts at 6:30, consistent with the second run above. But the first run includes an extra timestep, which throws subsequent runs off to start on the off-hour. Note that I am getting the time based on HEMCO clock messages. Could the HEMCO clock be incorrect? The HEMCO clock is displayed before the read/regrid. Perhaps it is a half-hour late upon initialization? I am particularly interested to know if nstep 0 in the very first run is a sort of dummy timestep in which case nstep 1 would be 00:00 and indicate the HEMCO clock is indeed off. |
I also want to note, in case anyone tries to replicate this, that I am using a modified HEMCO configuration input file. The default is to read these emissions monthly. I updated it to read hourly. |
Hi @lizziel. I've looked at your runs and see that you only had a successful restart when you changed the cam timestep to 1hr. I misunderstood your previous comments to imply that if cam restarted on an hour boundary then the restart would be successful. But that does not appear to be the case. Your run in ~elundgren/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep using a 1/2 hr timestep does a 6 step initial run followed by a 6 step restart run. Your runs aren't producing any history output since the default of monthly average output is used. But if, for example, every timestep were output ( The lines of output in the cam.log file beginning with 'nstep, te' are total energy diagnostics. This output is not issued at the end of a timestep. It is issued after the dycore runs (from the call to CAM's internal clock is initialized to the initial time for the run and I think the bottom line here is that since HEMCO's state depends on CAM's state, the way to restart exactly will be to write either HEMCO's state, or the CAM state needed by HEMCO to recompute its state, to the restart file. This is a common thing for CAM's physics parameterizations to do. For example mozart chemistry uses the routines I hope this helps. I'm happy to clarify further as needed. |
Correct. However, I believe that if cam is restarted on an hour boundary it should be successful, since the emissions would be read and regridded in the restarted run in the same way as a continuous run without restart. I am trying to determine why this is not the case to ensure there is not a bug in HEMCO. HEMCO treats clock time as start of timestep until the timestep is complete. From your response it sounds like CAM time is set to end of timestep when the first timestep (nstep 1) begins, which would conflict with HEMCO's interpretation of time as start of timestep. This would explain why the start time of HEMCO clock in the restart run is 3:30 rather than 3:00. Could you confirm? |
Yes, CAM sets the "current" time to be at the end of the timestep interval. What's confusing is that it doesn't advance the clock at the beginning of a new timestep. Rather, it advances the clock after a completed timestep has been (or could have been) written to the history file. I'm not sure whether or not that distinction is important for this discussion. During the final call to During the initialization phase of the restart run, CAM's clock is set to match its state at the end of the previous run. In this example the clock is set to step 7 and the current time is set to 3:30. This is shown in the |
Yes, this makes sense. I will work on the HEMCO clock to fix this since it is indeed a bug. |
Setting the HEMCO clock to the previous time indeed fixes the issue in my 3hr tests when using 30min timesteps. I do not expect the 9s ERP test to pass but would expect a 3hr one to pass in the test suite. Until we have a comprehensive fix would you be okay if we add such a test to verify one passes but the other fails? I will do it locally and can add it to my PR. I will prepare a PR with my updates. In the meantime, could someone copy the following to inputdata? I do not have permissions to put files there. Copy: To: |
I agree that we should add a restart test that works. I copied the HEMCO_Config file to the inputdata location. |
@brian-eaton, could you point me to where the first run of the ERP/ERS tests is set to 5 timesteps? I tried changing the total from 9 timesteps to 8 in the testdefs files but the result is still 5 timesteps for the first leg. Is it somewhere in cime? |
I don't know. I've looked at the cime testing documentation and don't see a way to control this using a standard test modifier. @jedwards4b, do you know, or can you point us in the right direction? |
I think you might have to define your own test type (not unheard of for CAM) because I think the issue is in
|
Does a 10 step test result in a 6 step run + a 4 step restart? |
Thanks @gold2718! I think we are using different cime versions since mine does not have what you show. However, I found the equivalent code which sets
Each of the different test types has the definition of |
@brian-eaton, I initially tested the equivalent with an 8 step test, with hopes it would be 4+4. It was still 5 I believe because REST_N is set as STOP_N / 2 + 1, as @gold2718 found and I did also in a separate location. |
Aha, @brian-eaton, I see now what you are saying. 10 / 2 + 1 is 6. I'll try that. Thanks! |
Doing a 10 step test indeed worked.
You can expect a PR from me with the fix soon. |
@lizziel Just returning from vacation and trying to catch up on the status of this issue - if I understand correctly the test will pass only if you specify a specific number of timesteps? (10 in this case) - that doesn't seem like an acceptable outcome to me - it should be able to pass regardless of the number of steps. |
Hi Jim. The issue is that HEMCO currently only restarts exactly from an hour boundary. Changing the number of steps to 10 allowed that to happen. We still have the 9 step test that fails to keep it on our radar. |
Got it. Thanks for the clarification. |
Please note the fix I came up with solves the issue with ERS tests. ERP still fails which needs further investigation. That test is the same as ERS but changes the tasks/threading counts on restart. |
And this also only fixes the issue for CAM-chem using HEMCO. GEOS-Chem will still fail ERS tests because it needs a few additional chemistry state fields beyond concentrations added to the restart file. |
Update on ERP tests: I can get identical results using different NTASKS if I turn off the 3D aviation emissions and the 2D CEDS ship emissions. All other emissions remain on. I am now looking into what could be different about these inputs that would cause task dependency. |
Also note there is continued discussion on a long-term fix for the ERS tests at my PR: #1221 |
What happened?
When HEMCO emissions are used restarts are not bit-for-bit the same as continuous runs.
Regression testing give this failure:
FAIL ERP_Ln9_Vnuopc.f09_f09_mg17.FCSD_HCO.cheyenne_intel.cam-outfrq9s COMPARE_base_rest
Until this is fixed, HEMCO configurations will not be supported.
What are the steps to reproduce the bug?
ERP_Ln9_Vnuopc.f09_f09_mg17.FCSD_HCO.cheyenne_intel.cam-outfrq9s COMPARE_base_rest
What CAM tag were you using?
cam6_3_118
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
No response
Will you be addressing this bug yourself?
No
Extra info
@jimmielin will need to provide hemco external updates to address this reproducibility / restart issue.
The text was updated successfully, but these errors were encountered: