-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON tests sometimes fail because of network issues #2310
Comments
See ESCOMP#2310: NEON tests sometimes fail (ESCOMP#2310)
See ESCOMP#2310: NEON tests sometimes fail (ESCOMP#2310)
See ESCOMP#2310: NEON tests sometimes fail (ESCOMP#2310)
I've seen a different problem than above with these tests. With certain testmods and testnames the length of filenames for the datm forcing for NEON can exceed 256 characters which is the limit for datm right now. For example for: SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO one of the filenames is: /derecho/scratch/erik/tests_alpha-ctsm52mksrf25_ctsm51d174acl/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO.GC.alpha-ctsm52mksrf25_ctsm51d174acl_int/run/inputdata/atm/cdeps/v2/NIWO/NIWO_atm_2018-01.nc which is 269 characters. I got the case to run by increasing the allowed filename in CDEPS from shr_kind_cl to shr_kind_cx which is 512. Another way to do it for NEON would be to use a relative path for the files. So inputdata/atm/cdeps/v2/NIWO/NIWO_atm_2018-01.nc which is shorter and more readable. This doesn't address the server issue in the main text of this. However, my guess is that for that you just need to run
in your test directory (possibly a few times) for a time when the server is up and the data can be transferred. |
@ekluzek That seems more related to #2322 than this issue. I'll change the title of this issue to be more specific. Also, for posterity: In some SE meeting, we decided that the fix for this issue would be to stop relying on the NEON servers in these tests. Instead, we'll download the necessary data somewhere and just point to that. |
I also just ran into trouble with this for the python system tests. I hadn't seen this before so documenting here: I also noticed that when this comes up the tests hang for a long time (relative to normal speed) before it fails. So it taking a long time is an indicator that this problem is coming up. I also noticed that the server issue is likely to stay a problem for several minutes, but can fix itself 5-10 minutes later, but then come up again in a similar time period after that.
|
I didn't want to open a whole new issue for this BUT... Same on izumi: |
This should change to |
We talked about this a bit this morning and I added #2942 as the way that would resolve this. For now we are leaving these as expected fails in case they do fail because of this problem. If it does happen the tester can usually wait until the servers are up, and then run check_input_data --download in the cases and get the data to resubmit. I will say I haven't been running into problem lately so perhaps things are more stable in the past. |
Brief summary of bug
The NEON tests in
aux_clm
sometimes fail because the network is unreachable.General bug information
CTSM version you are using:
ctsm5.1.dev162
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: NEON tests.
Details of bug
The tests:
@adrifoster and @ekluzek suggest that this may be a result of problems with the NEON server and/or Derecho compute nodes' (in)ability to connect to the outside world.
Important output or errors that show the problem
The failure seems to happen during
SHAREDLIB_BUILD
orRUN
, although sometimes the former stays marked as PEND inTestStatus
even after the job has ended—maybe a timeout?The text was updated successfully, but these errors were encountered: