Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple netcdf_parallel tests fail on hercules #2015

Closed
DeniseWorthen opened this issue Nov 29, 2023 · 122 comments · Fixed by #2044
Closed

multiple netcdf_parallel tests fail on hercules #2015

DeniseWorthen opened this issue Nov 29, 2023 · 122 comments · Fixed by #2044
Assignees
Labels
bug Something isn't working

Comments

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Nov 29, 2023

Description

The control_wrtGauss_netcdf_parallel_intel fails on hercules (intel) with the following error:

2023-11-29 11:42:48.544357 +0000 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-s\
tage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error

which results in

 Comparing atmf024.nc ............ALT CHECK......ERROR

To Reproduce:

Attempt to run this test on Hercules, check the atmf024.nc_nccmp.log file in the run directory for the error message.

Additional context

The issue was first noted in PR #1990.

Output

@BrianCurtis-NOAA
Copy link
Collaborator

@DeniseWorthen Do the files actually differ, or is NCCMP getting hung up with one of the options we use in the call to NCCMP?

@DeniseWorthen
Copy link
Collaborator Author

Unknown. It is failing comparison of that single file with "HDF error", as posted.

@junwang-noaa
Copy link
Collaborator

Can we compare the two files manually using nccmp? I am wondering if it is file difference or it is nccmp that causes the issue.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Nov 29, 2023

Transferring the baseline atmf024.nc file and the output from a failed RT case to hera and comparing them with

nccmp -d -S -q -f -B --Attribute=checksum --warn=format

also produces an error:

2023-11-29 15:13:22.378912 +0000 ERROR nccmp_data.c:3677 NetCDF: HDF error

On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file:

ncdump atmf024.nc >atmf024.cdl
NetCDF: HDF error
Location: file vardata.c; fcn print_rows line 478

I suspect it is the baseline file which is bad.

@BrianCurtis-NOAA
Copy link
Collaborator

@jkbk2004 Are you using cp or rsync for the RDHPCS machines to copy baselines to the baseline storage area?

@jkbk2004
Copy link
Collaborator

@jkbk2004 Are you using cp or rsync for the RDHPCS machines to copy baselines to the baseline storage area?

mix use. in this case, files are identical with experiment but nccmp issue

@junwang-noaa
Copy link
Collaborator

@jkbk2004 Please see the results from Denise.

On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file:

ncdump atmf024.nc >atmf024.cdl
NetCDF: HDF error
Location: file vardata.c; fcn print_rows line 478

It looks to me. The issue is in the baseline file, not nccmp.

@jkbk2004
Copy link
Collaborator

I will setup cases to confirm on both hera and hercules.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Nov 29, 2023

Just to note--I copied the files from hercules to hera and the report above is for using nccmp or ncdump on Hera. This was in case there was a problem w/ Hercules' nccmp version.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Nov 29, 2023

Thanks, Denise. That is what we want to confirm, if the comparison fails because of the nccmp or because of the file issue. @jkbk2004 I think you only need to check the baseline on Hecules.

@DusanJovic-NOAA
Copy link
Collaborator

Which files are we talking about? These two:

/work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_3114426/control_wrtGauss_netcdf_parallel_intel/atmf000.nc

and

/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf000.nc

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA I am checking with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc

@DeniseWorthen
Copy link
Collaborator Author

@DusanJovic-NOAA The file that fails w/ the hdf error is for the atmf024.nc file.

@jkbk2004
Copy link
Collaborator

@zach1221 nccmp/hercules compares ok /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_166684/control_wrtGauss_netcdf_parallel_intel/atmf024.nc atmf024.nc and /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf024.nc
but fails with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc
@DeniseWorthen @junwang-noaa ncdump conversion to cdl works ok with both baseline one and rt_166684 but not rt_175880. It looks hercules system issue

@zach1221
Copy link
Collaborator

zach1221 commented Dec 1, 2023

@DeniseWorthen I finally got around to testing your idea, regarding disk space. I cleaned out my experiment directories in stmp, and re-ran the control_wrtGauss cases on again with ecflow. With an emptied space they pass consistently.

@DeniseWorthen
Copy link
Collaborator Author

@zach1221 Really interesting, thanks. I wonder why this works!? I did check the file sizes for the atmf files and they weren't that large. Hm.

@DeniseWorthen DeniseWorthen changed the title control_wrtGauss_netcdf_parallel fails on hercules multiple netcdf_parallel tests fail on hercules Dec 6, 2023
@DeniseWorthen
Copy link
Collaborator Author

I've updated the title of this Issue. I just ran on Hercules with the CICE PR branch and got multiple failures with "ALT CHECK ERROR" for various tests:

Checking test 034 control_wrtGauss_netcdf_parallel_intel results ....
Comparing atmf024.nc ............ALT CHECK......ERROR

Checking test 060 regional_netcdf_parallel_intel results ....
Comparing phyf000.nc ............ALT CHECK......ERROR

Checking test 085 control_wrtGauss_netcdf_parallel_debug_intel results
Comparing atmf000.nc ............ALT CHECK......ERROR

Checking test 126 conus13km_debug_qr_intel results ....
Comparing RESTART/20210512.170000.fv_core.res.tile1.nc ............ALT CHECK......ERROR

@DeniseWorthen
Copy link
Collaborator Author

Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules.

@DusanJovic-NOAA
Copy link
Collaborator

Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules.

I those cases that fail, do you (always) see 'HDF error'?

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Dec 7, 2023

@DusanJovic-NOAA Yes, the failed cases seem to always show

ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-st
age-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error

My run directory is

/work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/

@DusanJovic-NOAA
Copy link
Collaborator

It seems the file /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc is either corrupted, or there's a bug in HDF5 library, or both.

When I run ncdump I see:

$ ncdump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
ncdump: /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc: NetCDF: HDF error

Same error message we see when we run nccmp. However when I run h5dump, I see:

$ h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

I'm not sure how to interpret this error message, HDF5 library bug, MPI/compiler bug, system/filesystem issue or something else.

@junwang-noaa
Copy link
Collaborator

@DeniseWorthen @DusanJovic-NOAA So far I noticed this error only happened on Hercules. We have hdf5 1.14.0 installed on other platforms for a while and I have never heard the error. I think we need to report the problem to Hercules system admin.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 7, 2023

@climbfuji @ulmononian FYI: some issues with HDF/nccmp on hercules. Random failures when it writes nc files.

@climbfuji
Copy link
Collaborator

@ulmononian Can EPIC look into this?

@DeniseWorthen
Copy link
Collaborator Author

@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990

baseline dir = /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990

baseline dir = /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel

That's the PR in which we added netcdf quantization. Maybe that is somehow triggering this HDF error, but why only on Hercules?

@DusanJovic-NOAA
Copy link
Collaborator

@lisa-bengtsson I believe the parallel compression issue is known with hdf5 1.10.6 in the workflow PR you referred, it happened on several platforms. We have moved hdf 1.14.0, which resolved the issue, see https://www.hdfgroup.org/2022/12/release-of-hdf5-1-14-0-newsletter-189/.

The problem here only happened on Hercules, the same tests passed on all other platforms.

@DusanJovic-NOAA is the regional test you ran using lossy compression or lossless compression?

Lossless (deflate).

@DusanJovic-NOAA
Copy link
Collaborator

I'm not sure if it's a long shot, but would it be convenient to try running where the output files being corrupted can be written directly to a non-lustre space, like /tmp or $HOME or something? EDIT: Just googling "lustre"+"hdf error" I'm seeing a few instances of users running into similar issues, so if it's possible to test writing to a non-lustre space it could help to rule out filesystem issues.

I changed the write routine to create output files in my home directory:

-          ncerr = nf90_create(trim(filename),&
+          ncerr = nf90_create("/home/djovic/hdf_out/"//trim(filename),&

and ran regional_netcdf_parallel test couple times (I think 3 or 4 times). After each run, I ran h5dump on each history output:

$ h5dump dynf000.nc > /dev/null
$ h5dump dynf003.nc > /dev/null
... etc

and all dumps were successful, i.e. no 'HDF error' or 'h5dump error: unable to print data'.

@climbfuji
Copy link
Collaborator

I think someone needs to bring this up to the sysadmins. Please let me know if anyone wants to do that, otherwise I can take care of it.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Dec 13, 2023

I created an ticket to Hercules helpdesk.
@DusanJovic-NOAA I want to confirm with you: 1) the model version in your test run under /home is e053209 2).The test you are running is regional_netcdf_parallel. would you please confirm?

@DusanJovic-NOAA
Copy link
Collaborator

I created an ticket to Hercules helpdesk. @DusanJovic-NOAA I want to confirm with you: 1) the model version in your test run under /home is e053209 2).The test you are running is regional_netcdf_parallel. would you please confirm?

Yes. For that test in which the history files were created i my home directory I used e053209 and I ran regional_netcdf_parallel

@DusanJovic-NOAA
Copy link
Collaborator

Hercules sysadmin suggests that we add export I_MPI_EXTRA_FILESYSTEM=ON to the job card.

diff --git a/tests/fv3_conf/fv3_slurm.IN_hercules b/tests/fv3_conf/fv3_slurm.IN_hercules
index 30ea2981..c4853fb5 100644
--- a/tests/fv3_conf/fv3_slurm.IN_hercules
+++ b/tests/fv3_conf/fv3_slurm.IN_hercules
@@ -36,8 +36,10 @@ export OMP_NUM_THREADS=@[THRD]
 export ESMF_RUNTIME_PROFILE=ON
 export ESMF_RUNTIME_PROFILE_OUTPUT="SUMMARY"
 
-# For mvapich2
-if [[ @[RT_COMPILER] == gnu ]]; then
+if [[ @[RT_COMPILER] == intel ]]; then
+  export I_MPI_EXTRA_FILESYSTEM=ON
+elif [[ @[RT_COMPILER] == gnu ]]; then
+  # For mvapich2
   export MV2_SHMEM_COLL_NUM_COMM=128
 fi

With this change I created new baselines for 4 tests that were frequently failing (control_CubedSphereGrid_parallel, control_wrtGauss_netcdf_parallel, regional_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) and then ran the regression test for these tests against newly created baselines. All tests passed. I ran the test 2 times so far and I'm going to run it few more times to make sure it consistently passes without crashing nccmp with HDF error.

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA thanks for the update. We can follow up with the next pr, #2044

@jkbk2004
Copy link
Collaborator

@zach1221 @FernandoAndrade-NOAA FYI

@edwardhartnett
Copy link
Contributor

It's important that the HDF5 tests are run, and that the netCDF-4 parallel tests are run. If these tests do not pass, then there is a problem with the machine or the install of netCDF. If the HDF5 and netCDF-4 parallel tests pass, then the problem is likely in user code.

Have these tests been run on hercules? Who did the install there? @AlexanderRichert-NOAA do you know if the netCDF and HDF5 parallel I/O tests were run on hercules?

@climbfuji
Copy link
Collaborator

Hercules sysadmin suggests that we add export I_MPI_EXTRA_FILESYSTEM=ON to the job card.

diff --git a/tests/fv3_conf/fv3_slurm.IN_hercules b/tests/fv3_conf/fv3_slurm.IN_hercules
index 30ea2981..c4853fb5 100644
--- a/tests/fv3_conf/fv3_slurm.IN_hercules
+++ b/tests/fv3_conf/fv3_slurm.IN_hercules
@@ -36,8 +36,10 @@ export OMP_NUM_THREADS=@[THRD]
 export ESMF_RUNTIME_PROFILE=ON
 export ESMF_RUNTIME_PROFILE_OUTPUT="SUMMARY"
 
-# For mvapich2
-if [[ @[RT_COMPILER] == gnu ]]; then
+if [[ @[RT_COMPILER] == intel ]]; then
+  export I_MPI_EXTRA_FILESYSTEM=ON
+elif [[ @[RT_COMPILER] == gnu ]]; then
+  # For mvapich2
   export MV2_SHMEM_COLL_NUM_COMM=128
 fi

With this change I created new baselines for 4 tests that were frequently failing (control_CubedSphereGrid_parallel, control_wrtGauss_netcdf_parallel, regional_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) and then ran the regression test for these tests against newly created baselines. All tests passed. I ran the test 2 times so far and I'm going to run it few more times to make sure it consistently passes without crashing nccmp with HDF error.

Excellent, thanks! I will add this to the hercules intel config in spack-stack, too. So this hopefully solves the Intel problems discussed here. There are still gnu problems discussed elsewhere, correct?

@jkbk2004
Copy link
Collaborator

zach1221 pushed a commit that referenced this issue Dec 21, 2023
…nodi + #2047, #2053, and #2056 (#2044)

FV3 diagnostic fixes, CCPP fixes for model crashes, new PR template

- UFS:
    - commit message in PR template (#2053)
    - fix hercules crashes (#2015)
- CMEPS & FV3: Bad data from in CCPP CLM Lake physics scheme caused model crashes
    - Communicate changes to lake ice (Closes #2055, NOAA-EMC/CMEPS#105, NOAA-EMC/fv3atm#741) 
    - unit mismatch (NOAA-EMC/fv3atm#736)
- FV3: correct errors in diagnostic calculations
    - snodi had weasdi data in it (NOAA-EMC/fv3atm#736)
    - revisions to RUC LSM snowfall melting and accumulation (NOAA-EMC/fv3atm#739)
@TingLei-NOAA
Copy link

@edwardhartnett will you have any suggestions on further investigating of this issue? The problem could occur for either nf90_get_var or nf90_put_var ( in the context of a parallel IO (the one of netcdf 4 on top of hdf5))

@climbfuji
Copy link
Collaborator

I thought this was resolved for Intel with the additonal env variable (see lengthy discussion above). For GNU, please try the alternative spack-stack-1.6.0 [email protected][email protected] installation.

@TingLei-NOAA
Copy link

Sorry, I asked in a wrong issue. Thanks @climbfuji . The issue i asked for is NOAA-EMC/GSI#684.

@edwardhartnett
Copy link
Contributor

edwardhartnett commented Feb 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects