Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HEMCO restart tests fail #856

Open
fvitt opened this issue Jul 17, 2023 · 33 comments
Open

HEMCO restart tests fail #856

fvitt opened this issue Jul 17, 2023 · 33 comments
Labels
bug Something isn't working correctly CoupledEval4 -wish list

Comments

@fvitt
Copy link

fvitt commented Jul 17, 2023

What happened?

When HEMCO emissions are used restarts are not bit-for-bit the same as continuous runs.
Regression testing give this failure:
FAIL ERP_Ln9_Vnuopc.f09_f09_mg17.FCSD_HCO.cheyenne_intel.cam-outfrq9s COMPARE_base_rest

Until this is fixed, HEMCO configurations will not be supported.

What are the steps to reproduce the bug?

ERP_Ln9_Vnuopc.f09_f09_mg17.FCSD_HCO.cheyenne_intel.cam-outfrq9s COMPARE_base_rest

What CAM tag were you using?

cam6_3_118

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

No

Extra info

@jimmielin will need to provide hemco external updates to address this reproducibility / restart issue.

@lizziel
Copy link
Collaborator

lizziel commented Nov 13, 2024

Following up about the HEMCO restart issue, I figured out the problem is vertical regridding of 3D emissions using instantaneous surface pressure. The specific problem is that the minimum read frequency for HEMCO is 1 hour on the hour, i.e. 0:00, 1:00, etc., but restarted runs which use a sub-hour timestep, e.g. 30 min, begin off the hour, i.e. 0:00 first run (okay), but then 3:30, 6:30, etc. I am able to eliminate differences by changing the timestep to 1 hour. This results in vertical regridding using identical pressures across all runs since no regridding occurs off the hour in any of the runs.

Solutions within HEMCO could include storage of raw file data within HEMCO to be regridded each timestep, or sub-hour reads. Neither are great from a performance point-of-view. Another option, which is simpler but would impact you, is having restarted runs begin on the hour rather than the first timestep after the hour. This is what we do in offline GEOS-Chem and within GEOS and hence we have never had this problem before. Note that all of my testing using CAM-chem not GEOS-Chem. Is there a reason you start the subsequent runs at the timestep after the time corresponding to the restart file (cam.r.*)?

To be more clear about how runs are currently split up in time within CAM, here is a breakdown of my own 30 min timestep runs.

Compset: FCHIST_HCO
Config params: REST_N=3, STOP_N=3, STOP_OPTION=nhours, ATM_NCPL=48

Run A – seven full timesteps (in GEOS-Chem CTM, HEMCO standalone, and GEOS this would only be six full timesteps)
nstep 0 - 00:00. <- HEMCO 3D read/regrid
nstep 1 – 00:30
nstep 2 – 01:00. <- HEMCO 3D read/regrid
nstep 3 – 01:30
nstep 4 – 02:00. <- HEMCO 3D read/regrid
nstep 5 – 02:30
nstep 6 – 03:00 <- HEMCO 3D read/regrid
nstep 7 – 03:30 (partial)

Run B (restarted) – six full timesteps
nstep 7 – 03:30. <- HEMCO 3D read/regrid
nstep 8 – 04:00. <- HEMCO 3D read/regrid
nstep 9 – 04:30
nstep10 – 05:00 <- HEMCO 3D read/regrid
nstep 11 – 05:30
nstep 12 – 06:00 <- HEMCO 3D read/regrid
nstep 13 – 06:30 (partial)

All additional subsequent runs are also six full timesteps. Why does the initial run have seven? Using six would solve the HEMCO problem. It is also more intuitive for a 3 hour run with 30 minute time intervals to execute 6 full timesteps.

@lizziel
Copy link
Collaborator

lizziel commented Nov 14, 2024

Another solution could be to store the last hour's edge pressures in the restart file and use those for the initial regrid upon restart. A cleaner solution, however, is to start runs that are restarted at integer number of hours on the hour rather than offset.

@adamrher
Copy link

That's great news that you tracked down this issue, @lizziel.

@brian-eaton I don't remember why the model integrates for a half time-step past STOP_N. Can you refresh my memory? Thanks!

@lizziel
Copy link
Collaborator

lizziel commented Nov 15, 2024

In case it helps, here is what the execution would look like to fix the problem:

Run A – six full timesteps
nstep 0 - 00:00 <- HEMCO 3D read/regrid
nstep 1 – 00:30
nstep 2 – 01:00 <- HEMCO 3D read/regrid
nstep 3 – 01:30
nstep 4 – 02:00 <- HEMCO 3D read/regrid
nstep 5 – 02:30
nstep 6 – 03:00 (partial)

Run B (restarted) – six full timesteps
nstep 6 – 03:00 <- HEMCO 3D read/regrid
nstep 7 – 03:30
nstep 8 – 04:00 <- HEMCO 3D read/regrid
nstep 9 – 04:30
nstep10 – 05:00 <- HEMCO 3D read/regrid
nstep 11 – 05:30
nstep 12 – 06:00 (partial)

etc.

@brian-eaton
Copy link
Collaborator

The ERP_Ln9_* test does an initial run of 5 steps and then does a 4 step restart. For a model configuration with a 1/2 hour timestep, that means the model is restarting at 9000 seconds (you can see this in the timestamp of the restart file) and so HEMCO is initializing to a different state for timestep 6 than it would have if it were doing a continuous 6 timestep run. This clearly won't pass the restart test.

The fact that CAM's timeloop does some calculations (and writes a confusing message about it to the log file) after the restart file is written and before control returns to the coupler is not relevant to this restart failure. The calculations that are done after the restart file is written are redone during the initialization phase of a restart run (don't ask why:).

CAM has historically required being able to restart from an arbitrary timestep. The benefit of that is mainly for debugging purposes. It sounds like HEMCO could be made to restore its state at a non-hour boundary if it
saved the edge pressures from the correct hour boundary to the restart file as suggested by Lizzie. I think this is the best solution if we want to keep the requirement that CAM be able to restart from an arbitrary timestep.

In the meantime I think we should adjust the restart tests with HEMCO to only restart on an hour boundary. I don't know the cime testing infrastructure well enough to know the best way to do this, but I assume it's not hard.

One thing keep on the radar though is that the following tests seem to randomly fail baseline comparisions, and this is not related to restarts:
SMS_Ld1.f09_f09_mg17.FCHIST_GC.derecho_intel.cam-outfrq1d (Overall: DIFF)
SMS_Lh12.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq3h (Overall: DIFF)

@adamrher
Copy link

It sounds like HEMCO could be made to restore its state at a non-hour boundary if it
saved the edge pressures from the correct hour boundary to the restart file as suggested by Lizzie. I think this is the best solution if we want to keep the requirement that CAM be able to restart from an arbitrary timestep.

From a practical and user-friendly perspective, this seems like a requirement we should keep! We don't need the added hassle of making sure STOP_N lands on the hour for any arbitrary ATM_NCPL, especially if this becomes the default emissions driver in CAM7. Thanks for the clarification Brian.

@cacraigucar
Copy link
Collaborator

@brian-eaton - I would suggest we NOT change the HEMCO test so that it uses a setup which passes. Since HEMCO is targeted to be a default setup in CESM, I want to make sure we remember it is broken until it is fixed. I know it is a pain to have a continuously failing regression test, but in this case, I think it is important to keep it failing until a fix is provided for HEMCO.

@brian-eaton
Copy link
Collaborator

I would argue that having a test that fails for a known reason can mask an unanticipated failure. I think tests that pass are more informative. A way to keep track of a feature we want that doesn't exist is through an open issue.

@lizziel
Copy link
Collaborator

lizziel commented Nov 18, 2024

I support saving the last hour edge pressure to use with HEMCO so that runs are allowed to start/stop off the hour.

Separate from this I am still confused about why my first run completed 7 full timesteps plus 1 partial. In my tests I did not use the ERP in the test suite but instead manually created two cases. One of the tests used STOP_N set to 3 while the other had it set to 6. I ran the one with STOP_N=3 twice, first using initial values second using CONTINUE_RUN=TRUE. Unlike the ERP test I would think this would do an even number of 30min timesteps. Is there a namelist parameter I am not setting correctly?

@brian-eaton
Copy link
Collaborator

@lizziel, if your tests are on derecho and you point me to them I'll take a look.

@lizziel
Copy link
Collaborator

lizziel commented Nov 18, 2024

@brian-eaton, thank you for taking a look. My single (6hr) and double (two 3hr) runs, both 30min and 1hr timesteps, are archived here:

/glade/derecho/scratch/elundgren/case.FCHIST_HCO_single/run_aeic2019_NO_30min_timestep
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_single/run_aeic2019_NO_1hr_timestep
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_double/run_aeic2019_NO_1hr_timestep

I wrote a script to compare the cam.r.* files. Results are below. The first few lines show the files being compared.

Results of 30min timestep test:

 ~/scratch/case.FCHIST_HCO_double $ ./compare_files.sh aeic2019_NO_30min_timestep
../case.FCHIST_HCO_single/run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_single.cam.r.2010-01-01-10800.nc
../case.FCHIST_HCO_single/run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_single.cam.r.2010-01-01-21600.nc
run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_double.cam.r.2010-01-01-10800.nc
run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_double.cam.r.2010-01-01-21600.nc
 A total number of    625 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of     17 fields could not be analyzed
 A total number of      0 fields on file 1 were not found on file 2.
 A total number of      0 fields on file 2 were not found on file 1.
  diff_test: the two files seem to be IDENTICAL 
  
 A total number of    625 fields were compared
          of which    465 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of     17 fields could not be analyzed
 A total number of      0 fields on file 1 were not found on file 2.
 A total number of      0 fields on file 2 were not found on file 1.
  diff_test: the two files seem to be DIFFERENT 

Results of 1hr timestep test:

 ~/scratch/case.FCHIST_HCO_double $ ./compare_files.sh aeic2019_NO_1hr_timestep/
../case.FCHIST_HCO_single/run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_single.cam.r.2010-01-01-10800.nc
../case.FCHIST_HCO_single/run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_single.cam.r.2010-01-01-21600.nc
run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_double.cam.r.2010-01-01-10800.nc
run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_double.cam.r.2010-01-01-21600.nc
 A total number of    625 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of     17 fields could not be analyzed
 A total number of      0 fields on file 1 were not found on file 2.
 A total number of      0 fields on file 2 were not found on file 1.
  diff_test: the two files seem to be IDENTICAL 
  
 A total number of    625 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of     17 fields could not be analyzed
 A total number of      0 fields on file 1 were not found on file 2.
 A total number of      0 fields on file 2 were not found on file 1.
  diff_test: the two files seem to be IDENTICAL 

Two consecutive run logs using 30min timestep, with grep results

The log for the first 30min timestep run:

~/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep $ grep "nstep\|Set HEMCO clock" atm.log.6611587.desched1.241113-103513
 nstep, te        0   0.25981444534833994E+10   0.25981444534833994E+10   0.00000000000000000E+00   0.98517093638948078E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 00:00:00
 nstep, te        1   0.25978798502492533E+10   0.25979003402337537E+10   0.11356759501907610E-02   0.98516505232345444E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 00:30:00
 nstep, te        2   0.25978850270017157E+10   0.25978866429912839E+10   0.89567667871924043E-04   0.98516524618494484E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 01:00:00
 nstep, te        3   0.25978674995376434E+10   0.25978698021196389E+10   0.12762267041154914E-03   0.98516522409252182E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 01:30:00
 nstep, te        4   0.25978178776109223E+10   0.25978193671641116E+10   0.82559805615334378E-04   0.98516539872440655E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 02:00:00
 nstep, te        5   0.25977797488725424E+10   0.25977796423065767E+10  -0.59065092744659974E-05   0.98516603353383412E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 02:30:00
 nstep, te        6   0.25977395788309808E+10   0.25977398829856963E+10   0.16858025724902301E-04   0.98516650287634140E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 03:00:00
 nstep, te        7   0.25977030025441728E+10   0.25977034453752980E+10   0.24544269848086294E-04   0.98516690941469933E+05   0.22552395239472389E+03

The log for the second 30min timestep run:

 ~/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep $ grep "nstep\|Set HEMCO clock" atm.log.6611639.desched1.241113-104502
 nstep, te        7   0.25977030025441728E+10   0.25977034453752980E+10   0.24544269848086294E-04   0.98516690941469933E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 03:30:00
 nstep, te        8   0.25976659331167064E+10   0.25976665821389523E+10   0.35972566028214541E-04   0.98516723320908553E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 04:00:00
 nstep, te        9   0.25976323912652130E+10   0.25976328323850460E+10   0.24449402782272903E-04   0.98516760575487977E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 04:30:00
 nstep, te       10   0.25975979038978596E+10   0.25975985615967813E+10   0.36453453032763774E-04   0.98516789498765429E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 05:00:00
 nstep, te       11   0.25975666961468062E+10   0.25975673227889829E+10   0.34732099360906552E-04   0.98516819002134420E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 05:30:00
 nstep, te       12   0.25975351245179186E+10   0.25975358664776978E+10   0.41123650966508243E-04   0.98516844262055791E+05   0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 06:00:00
 nstep, te       13   0.25975072106202283E+10   0.25975077808493266E+10   0.31605345496482084E-04   0.98516874617956593E+05   0.22552395239472389E+03

A 3rd run (not saved) starts at 6:30, consistent with the second run above. But the first run includes an extra timestep, which throws subsequent runs off to start on the off-hour.

Note that I am getting the time based on HEMCO clock messages. Could the HEMCO clock be incorrect? The HEMCO clock is displayed before the read/regrid. Perhaps it is a half-hour late upon initialization? I am particularly interested to know if nstep 0 in the very first run is a sort of dummy timestep in which case nstep 1 would be 00:00 and indicate the HEMCO clock is indeed off.

@lizziel
Copy link
Collaborator

lizziel commented Nov 18, 2024

I also want to note, in case anyone tries to replicate this, that I am using a modified HEMCO configuration input file. The default is to read these emissions monthly. I updated it to read hourly.

@brian-eaton
Copy link
Collaborator

Hi @lizziel. I've looked at your runs and see that you only had a successful restart when you changed the cam timestep to 1hr. I misunderstood your previous comments to imply that if cam restarted on an hour boundary then the restart would be successful. But that does not appear to be the case. Your run in ~elundgren/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep using a 1/2 hr timestep does a 6 step initial run followed by a 6 step restart run. Your runs aren't producing any history output since the default of monthly average output is used. But if, for example, every timestep were output (nhtfrq=1), then you'd see that the initial 6 step run outputs 6 time samples with the last one ending at hour 3.0. The restart file contains the timestamp 2010-01-01-10800 indicating that the model will restart from hour 3.0. But as you've noticed the restart file does not contain the exact model state corresponding to hour 3.0, but rather contains state that is
part way through the timestep that starts at hour 3.0. In the restart run, the restart file is read and the step from hour 3.0 to hour 3.5 is finished during the initialization phase of that run.

The lines of output in the cam.log file beginning with 'nstep, te' are total energy diagnostics. This output is not issued at the end of a timestep. It is issued after the dycore runs (from the call to check_energy_gmean).

CAM's internal clock is initialized to the initial time for the run and nstep is set to 0. nstep remains 0 during the initialization phase which includes some convective adjustments to the initial state. The first time ModelAdvance is called, nstep is still 0, and the call to wshist will write the state that is stored in the history buffers from any outfld calls made while nstep=0. This output was on by default until fairly recently (cam6_4_141), but is now only done if requested since it is not the result of a model timestep. Rather it is just what convective adjustments have done to the initial conditions. After this CAM's internal clock is updated (by the call to advance_timestep) to the time at the end of the first timestep, and nstep is set to 1. It looks from the log output posted that the HEMCO clock is in synch with CAM's, although I can't tell when it is updated relative to CAM's update.

I think the bottom line here is that since HEMCO's state depends on CAM's state, the way to restart exactly will be to write either HEMCO's state, or the CAM state needed by HEMCO to recompute its state, to the restart file. This is a common thing for CAM's physics parameterizations to do. For example mozart chemistry uses the routines chem_init_restart, chem_write_restart, and chem_read_restart for this purpose.

I hope this helps. I'm happy to clarify further as needed.

@lizziel
Copy link
Collaborator

lizziel commented Nov 21, 2024

I've looked at your runs and see that you only had a successful restart when you changed the cam timestep to 1hr. I misunderstood your previous comments to imply that if cam restarted on an hour boundary then the restart would be successful. But that does not appear to be the case.

Correct. However, I believe that if cam is restarted on an hour boundary it should be successful, since the emissions would be read and regridded in the restarted run in the same way as a continuous run without restart. I am trying to determine why this is not the case to ensure there is not a bug in HEMCO.

HEMCO treats clock time as start of timestep until the timestep is complete. From your response it sounds like CAM time is set to end of timestep when the first timestep (nstep 1) begins, which would conflict with HEMCO's interpretation of time as start of timestep. This would explain why the start time of HEMCO clock in the restart run is 3:30 rather than 3:00. Could you confirm?

@brian-eaton
Copy link
Collaborator

Yes, CAM sets the "current" time to be at the end of the timestep interval. What's confusing is that it doesn't advance the clock at the beginning of a new timestep. Rather, it advances the clock after a completed timestep has been (or could have been) written to the history file. I'm not sure whether or not that distinction is important for this discussion.

During the final call to ModelAdvance in the initial 6 step run, the restart file is written with state representing a partially done step 7, and advance_timestep is called which sets nstep=7 and the current time to
the end of step 7, i.e., 3:30.

During the initialization phase of the restart run, CAM's clock is set to match its state at the end of the previous run. In this example the clock is set to step 7 and the current time is set to 3:30. This is shown in the * Time Manager Configuration * output in the log file. If the HEMCO clock is set to the time at the beginning of the timestep, then it should be set to 3:00 at the beginning of the restart run. Does that make sense?

@lizziel
Copy link
Collaborator

lizziel commented Nov 22, 2024

Yes, this makes sense. I will work on the HEMCO clock to fix this since it is indeed a bug.

@lizziel
Copy link
Collaborator

lizziel commented Nov 22, 2024

Setting the HEMCO clock to the previous time indeed fixes the issue in my 3hr tests when using 30min timesteps. I do not expect the 9s ERP test to pass but would expect a 3hr one to pass in the test suite. Until we have a comprehensive fix would you be okay if we add such a test to verify one passes but the other fails? I will do it locally and can add it to my PR.

I will prepare a PR with my updates. In the meantime, could someone copy the following to inputdata? I do not have permissions to put files there.

Copy:
/glade/u/home/elundgren/config/HEMCO_Config.CC.CEDS_AEIC19.NEx.c241122.rc

To:
/glade/campaign/cesm/cesmdata/cseg/inputdata/atm/cam/geoschem/emis

@brian-eaton
Copy link
Collaborator

I agree that we should add a restart test that works.

I copied the HEMCO_Config file to the inputdata location.

@lizziel
Copy link
Collaborator

lizziel commented Dec 3, 2024

@brian-eaton, could you point me to where the first run of the ERP/ERS tests is set to 5 timesteps? I tried changing the total from 9 timesteps to 8 in the testdefs files but the result is still 5 timesteps for the first leg. Is it somewhere in cime?

@brian-eaton
Copy link
Collaborator

I don't know. I've looked at the cime testing documentation and don't see a way to control this using a standard test modifier. @jedwards4b, do you know, or can you point us in the right direction?

@gold2718
Copy link
Collaborator

gold2718 commented Dec 3, 2024

I think you might have to define your own test type (not unheard of for CAM) because I think the issue is in
<cime>/CIME/SystemTests/system_tests_common.py in the function, _set_restart_interval in these calculations:

        stop_n = int(stop_n * factor // coupling_secs)
        rest_n = math.ceil((stop_n // 2 + 1) * coupling_secs / factor)

@brian-eaton
Copy link
Collaborator

Does a 10 step test result in a 6 step run + a 4 step restart?

@lizziel
Copy link
Collaborator

lizziel commented Dec 3, 2024

Thanks @gold2718! I think we are using different cime versions since mine does not have what you show. However, I found the equivalent code which sets REST_N in CIME/data/config/config_tests.xml.

CIME/data/config/config_tests.xml:543: <REST_N>$STOP_N / 2 + 1</REST_N>

Each of the different test types has the definition of REST_N there. I am doing a test by changing the existing ERS test REST_N. For a solution to submit I will create new test, ERS3 (ERS2 is taken). Do I also need to make an ers3.py file within CIME/SystemTests? I assume I then reference it in testdefs with something like 'ERS3_Ln9'.

@lizziel
Copy link
Collaborator

lizziel commented Dec 3, 2024

@brian-eaton, I initially tested the equivalent with an 8 step test, with hopes it would be 4+4. It was still 5 I believe because REST_N is set as STOP_N / 2 + 1, as @gold2718 found and I did also in a separate location.

@lizziel
Copy link
Collaborator

lizziel commented Dec 3, 2024

Aha, @brian-eaton, I see now what you are saying. 10 / 2 + 1 is 6. I'll try that. Thanks!

@lizziel
Copy link
Collaborator

lizziel commented Dec 4, 2024

Doing a 10 step test indeed worked.

 ~/scratch $ cd ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s.20241204_083007_0fm9b0/
 ~/scratch/ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s.20241204_083007_0fm9b0 $ cat TestStatus
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s CREATE_NEWCASE
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s XML
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SETUP
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SHAREDLIB_BUILD time=248
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s MODEL_BUILD time=409
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SUBMIT
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s RUN time=614
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s COMPARE_base_rest
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s MEMLEAK insufficient data for memleak test
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SHORT_TERM_ARCHIVER

You can expect a PR from me with the fix soon.

@jedwards4b
Copy link

@lizziel Just returning from vacation and trying to catch up on the status of this issue - if I understand correctly the test will pass only if you specify a specific number of timesteps? (10 in this case) - that doesn't seem like an acceptable outcome to me - it should be able to pass regardless of the number of steps.

@brian-eaton
Copy link
Collaborator

Hi Jim. The issue is that HEMCO currently only restarts exactly from an hour boundary. Changing the number of steps to 10 allowed that to happen. We still have the 9 step test that fails to keep it on our radar.

@jedwards4b
Copy link

Got it. Thanks for the clarification.

@lizziel
Copy link
Collaborator

lizziel commented Dec 4, 2024

Please note the fix I came up with solves the issue with ERS tests. ERP still fails which needs further investigation. That test is the same as ERS but changes the tasks/threading counts on restart.

@lizziel
Copy link
Collaborator

lizziel commented Dec 4, 2024

And this also only fixes the issue for CAM-chem using HEMCO. GEOS-Chem will still fail ERS tests because it needs a few additional chemistry state fields beyond concentrations added to the restart file.

@lizziel
Copy link
Collaborator

lizziel commented Jan 16, 2025

Update on ERP tests:

I can get identical results using different NTASKS if I turn off the 3D aviation emissions and the 2D CEDS ship emissions. All other emissions remain on. I am now looking into what could be different about these inputs that would cause task dependency.

@lizziel
Copy link
Collaborator

lizziel commented Jan 16, 2025

Also note there is continued discussion on a long-term fix for the ERS tests at my PR: #1221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly CoupledEval4 -wish list
Projects
Status: To Do
Development

No branches or pull requests

8 participants