Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON tests sometimes fail because of network issues #2310

Open
samsrabin opened this issue Jan 10, 2024 · 6 comments
Open

NEON tests sometimes fail because of network issues #2310

samsrabin opened this issue Jan 10, 2024 · 6 comments
Labels
bug something is working incorrectly testing additions or changes to tests

Comments

@samsrabin
Copy link
Collaborator

samsrabin commented Jan 10, 2024

Brief summary of bug

The NEON tests in aux_clm sometimes fail because the network is unreachable.

General bug information

CTSM version you are using: ctsm5.1.dev162

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: NEON tests.

Details of bug

The tests:

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.derecho_gnu.clm-default--clm-NEON-NIWO
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.derecho_gnu.clm-NEON-MOAB--clm-PRISM
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_gnu.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_gnu.clm-FatesPRISM--clm-NEON-FATES-YELL
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51SpRs.derecho_gnu.clm-default--clm-NEON-TOOL

@adrifoster and @ekluzek suggest that this may be a result of problems with the NEON server and/or Derecho compute nodes' (in)ability to connect to the outside world.

Important output or errors that show the problem

2024-01-08 11:17:01: Test 'SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.derecho_gnu.clm-default--clm-NEON-NIWO' failed in phase 'SETUP' with exception 'ERROR: Fatal error in case.cmpgen_namelists: 2024-01-08 11:16:56 atm
Create namelist for component datm
   Calling /glade/u/home/samrabin/ctsm_tillage-and-residues4/components/cdeps/datm/cime_config/buildnml
WARNING: No .input_data_list files found in dir 'Buildconf'
Using protocol wget with user None and passwd None
wget failed with output:  and errput --2024-01-08 11:16:58--  https://storage.neonscience.org/neon-ncar/listing.csv
Resolving storage.neonscience.org (storage.neonscience.org)... 34.110.164.243
Connecting to storage.neonscience.org (storage.neonscience.org)|34.110.164.243|:443... failed: Network is unreachable.

ERROR: Could not download NEON data listing file from server'
  File "/glade/u/home/samrabin/ctsm_tillage-and-residues4/cime/CIME/test_scheduler.py", line 1125, in _run_catch_exceptions
    return run(test)
  File "/glade/u/home/samrabin/ctsm_tillage-and-residues4/cime/CIME/test_scheduler.py", line 1016, in _setup_phase
    "Fatal error in case.cmpgen_namelists: {}".format(output),
  File "/glade/u/home/samrabin/ctsm_tillage-and-residues4/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)

The failure seems to happen during SHAREDLIB_BUILD or RUN, although sometimes the former stays marked as PEND in TestStatus even after the job has ended—maybe a timeout?

@samsrabin samsrabin added the testing additions or changes to tests label Jan 10, 2024
samsrabin added a commit to samsrabin/CTSM that referenced this issue Jan 10, 2024
samsrabin added a commit to samsrabin/CTSM that referenced this issue Jan 10, 2024
samsrabin added a commit to samsrabin/CTSM that referenced this issue Jan 10, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Mar 19, 2024

I've seen a different problem than above with these tests. With certain testmods and testnames the length of filenames for the datm forcing for NEON can exceed 256 characters which is the limit for datm right now. For example for:

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO

one of the filenames is:

/derecho/scratch/erik/tests_alpha-ctsm52mksrf25_ctsm51d174acl/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO.GC.alpha-ctsm52mksrf25_ctsm51d174acl_int/run/inputdata/atm/cdeps/v2/NIWO/NIWO_atm_2018-01.nc

which is 269 characters. I got the case to run by increasing the allowed filename in CDEPS from shr_kind_cl to shr_kind_cx which is 512. Another way to do it for NEON would be to use a relative path for the files. So

inputdata/atm/cdeps/v2/NIWO/NIWO_atm_2018-01.nc

which is shorter and more readable.

This doesn't address the server issue in the main text of this. However, my guess is that for that you just need to run

./check_input_data --download

in your test directory (possibly a few times) for a time when the server is up and the data can be transferred.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 19, 2024
@samsrabin
Copy link
Collaborator Author

@ekluzek That seems more related to #2322 than this issue. I'll change the title of this issue to be more specific.

Also, for posterity: In some SE meeting, we decided that the fix for this issue would be to stop relying on the NEON servers in these tests. Instead, we'll download the necessary data somewhere and just point to that.

@samsrabin samsrabin changed the title NEON tests sometimes fail NEON tests sometimes fail because of network issues Mar 19, 2024
@samsrabin samsrabin removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 21, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Mar 29, 2024

I also just ran into trouble with this for the python system tests. I hadn't seen this before so documenting here:

I also noticed that when this comes up the tests hang for a long time (relative to normal speed) before it fails. So it taking a long time is an indicator that this problem is coming up. I also noticed that the server issue is likely to stay a problem for several minutes, but can fix itself 5-10 minutes later, but then come up again in a similar time period after that.

(ctsm_pylib) ctsm5.1.dev175/python> ./run_ctsm_py_tests --sys
................
Inactive Modules:
  1) hdf5/1.12.2     2) intel/2023.0.0     3) ncarcompilers/1.0.0     4) netcdf/4.9.2

Due to MODULEPATH changes, the following have been reloaded:
  1) conda/latest     2) craype/2.7.20

The following have been reloaded with a version change:
  1) cdo/2.1.1 => cdo/2.3.0     2) ncarenv/23.06 => ncarenv/23.09     3) nco/5.1.4 => nco/5.1.9     4) ncview/2.1.8 => ncview/2.1.9

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) cesmdev/1.0   2) ncarenv/23.09
Done converting /glade/derecho/scratch/erik/tmp/tmpug76mhug/scrip.nc
...E
Stdout:
in neonsite adding usermodsdirs
usermodsdirs: ['/glade/derecho/scratch/erik/ctsm5.1.dev175/cime_config/usermods_dirs/NEON/BART']
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---
Time required to building the base case: 397.0775320529938 s.
using this version: latest
---- cloning the base case in /glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient
Model datm missing file file1 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-01.nc'
Model datm missing file file2 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-02.nc'
Model datm missing file file3 = '
.
.
.
Model datm missing file file68 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2023-08.nc'
Model datm missing file file69 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2023-09.nc'
Model ctsm missing file finidat = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/lnd/ctsm/initdata/BART.2022-11-11.clm2.r.0418-01-01-00000.nc'

======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/test/test_sys_run_neon.py", line 57, in test_one_site
    main("")
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/run_neon.py", line 241, in main
    experiment,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/neon_site.py", line 103, in run_case
    base_case_root, run_type, prism, run_length, user_version, tower_type, user_mods_dirs
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/tower_site.py", line 416, in run_case
    case.submit(no_batch=no_batch)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 277, in submit
    is_batch=is_batch,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 2480, in run_and_log_case_status
    rv = func()
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 270, in <lambda>
    dryrun=dryrun,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 163, in _submit
    case.check_case(skip_pnl=skip_pnl, chksum=chksum)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 358, in check_case
    "Build complete is not True please rebuild the model by calling case.build",
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 176, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: Build complete is not True please rebuild the model by calling case.build

Stdout:
in neonsite adding usermodsdirs
usermodsdirs: ['/glade/derecho/scratch/erik/ctsm5.1.dev175/cime_config/usermods_dirs/NEON/BART']
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---
Time required to building the base case: 397.0775320529938 s.
using this version: latest
---- cloning the base case in /glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient
Model datm missing file file1 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-01.nc'
Model datm missing file file2 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-02.nc'
Model datm missing file file3 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-03.nc'
.
.
.
Model ctsm missing file finidat = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/lnd/ctsm/initdata/BART.2022-11-11.clm2.r.0418-01-01-00000.nc'

----------------------------------------------------------------------
Ran 20 tests in 457.158s

FAILED (errors=1)

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Sep 24, 2024

I didn't want to open a whole new issue for this BUT...
In #2500 this test changed from FAIL (expected) to PEND in the SHAREDLIB_BUILD phase
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-NEON-MOAB--clm-PRISM
with this error CLMBuildNamelist::add_default() : No default value found for fsurdat.

Same on izumi:
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.izumi_nag.clm-NEON-MOAB--clm-PRISM

@samsrabin
Copy link
Collaborator Author

This should change to SETUP FAIL once we bring in cime6.1.27 or later; see ESMCI/cime#4681.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 23, 2025

We talked about this a bit this morning and I added #2942 as the way that would resolve this. For now we are leaving these as expected fails in case they do fail because of this problem. If it does happen the tester can usually wait until the servers are up, and then run check_input_data --download in the cases and get the data to resubmit.

I will say I haven't been running into problem lately so perhaps things are more stable in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly testing additions or changes to tests
Projects
None yet
Development

No branches or pull requests

3 participants