-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple netcdf_parallel tests fail on hercules #2015
Comments
@DeniseWorthen Do the files actually differ, or is NCCMP getting hung up with one of the options we use in the call to NCCMP? |
Unknown. It is failing comparison of that single file with "HDF error", as posted. |
Can we compare the two files manually using nccmp? I am wondering if it is file difference or it is nccmp that causes the issue. |
Transferring the baseline atmf024.nc file and the output from a failed RT case to hera and comparing them with
also produces an error:
On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file: ncdump atmf024.nc >atmf024.cdl I suspect it is the baseline file which is bad. |
@jkbk2004 Are you using |
mix use. in this case, files are identical with experiment but nccmp issue |
@jkbk2004 Please see the results from Denise.
It looks to me. The issue is in the baseline file, not nccmp. |
I will setup cases to confirm on both hera and hercules. |
Just to note--I copied the files from hercules to hera and the report above is for using nccmp or ncdump on Hera. This was in case there was a problem w/ Hercules' nccmp version. |
Thanks, Denise. That is what we want to confirm, if the comparison fails because of the nccmp or because of the file issue. @jkbk2004 I think you only need to check the baseline on Hecules. |
Which files are we talking about? These two: /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_3114426/control_wrtGauss_netcdf_parallel_intel/atmf000.nc and /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf000.nc |
@DusanJovic-NOAA I am checking with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc |
@DusanJovic-NOAA The file that fails w/ the hdf error is for the atmf024.nc file. |
@zach1221 nccmp/hercules compares ok /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_166684/control_wrtGauss_netcdf_parallel_intel/atmf024.nc atmf024.nc and /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf024.nc |
@DeniseWorthen I finally got around to testing your idea, regarding disk space. I cleaned out my experiment directories in stmp, and re-ran the control_wrtGauss cases on again with ecflow. With an emptied space they pass consistently. |
@zach1221 Really interesting, thanks. I wonder why this works!? I did check the file sizes for the atmf files and they weren't that large. Hm. |
I've updated the title of this Issue. I just ran on Hercules with the CICE PR branch and got multiple failures with "ALT CHECK ERROR" for various tests: Checking test 034 control_wrtGauss_netcdf_parallel_intel results .... Checking test 060 regional_netcdf_parallel_intel results .... Checking test 085 control_wrtGauss_netcdf_parallel_debug_intel results Checking test 126 conus13km_debug_qr_intel results .... |
Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules. |
I those cases that fail, do you (always) see 'HDF error'? |
@DusanJovic-NOAA Yes, the failed cases seem to always show
My run directory is
|
It seems the file /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc is either corrupted, or there's a bug in HDF5 library, or both. When I run ncdump I see:
Same error message we see when we run nccmp. However when I run h5dump, I see:
I'm not sure how to interpret this error message, HDF5 library bug, MPI/compiler bug, system/filesystem issue or something else. |
@DeniseWorthen @DusanJovic-NOAA So far I noticed this error only happened on Hercules. We have hdf5 1.14.0 installed on other platforms for a while and I have never heard the error. I think we need to report the problem to Hercules system admin. |
@climbfuji @ulmononian FYI: some issues with HDF/nccmp on hercules. Random failures when it writes nc files. |
@ulmononian Can EPIC look into this? |
@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990
|
That's the PR in which we added netcdf quantization. Maybe that is somehow triggering this HDF error, but why only on Hercules? |
Lossless (deflate). |
I changed the write routine to create output files in my home directory:
and ran regional_netcdf_parallel test couple times (I think 3 or 4 times). After each run, I ran h5dump on each history output:
and all dumps were successful, i.e. no 'HDF error' or 'h5dump error: unable to print data'. |
I think someone needs to bring this up to the sysadmins. Please let me know if anyone wants to do that, otherwise I can take care of it. |
I created an ticket to Hercules helpdesk. |
Yes. For that test in which the history files were created i my home directory I used e053209 and I ran regional_netcdf_parallel |
Hercules sysadmin suggests that we add diff --git a/tests/fv3_conf/fv3_slurm.IN_hercules b/tests/fv3_conf/fv3_slurm.IN_hercules
index 30ea2981..c4853fb5 100644
--- a/tests/fv3_conf/fv3_slurm.IN_hercules
+++ b/tests/fv3_conf/fv3_slurm.IN_hercules
@@ -36,8 +36,10 @@ export OMP_NUM_THREADS=@[THRD]
export ESMF_RUNTIME_PROFILE=ON
export ESMF_RUNTIME_PROFILE_OUTPUT="SUMMARY"
-# For mvapich2
-if [[ @[RT_COMPILER] == gnu ]]; then
+if [[ @[RT_COMPILER] == intel ]]; then
+ export I_MPI_EXTRA_FILESYSTEM=ON
+elif [[ @[RT_COMPILER] == gnu ]]; then
+ # For mvapich2
export MV2_SHMEM_COLL_NUM_COMM=128
fi With this change I created new baselines for 4 tests that were frequently failing (control_CubedSphereGrid_parallel, control_wrtGauss_netcdf_parallel, regional_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) and then ran the regression test for these tests against newly created baselines. All tests passed. I ran the test 2 times so far and I'm going to run it few more times to make sure it consistently passes without crashing nccmp with HDF error. |
@DusanJovic-NOAA thanks for the update. We can follow up with the next pr, #2044 |
It's important that the HDF5 tests are run, and that the netCDF-4 parallel tests are run. If these tests do not pass, then there is a problem with the machine or the install of netCDF. If the HDF5 and netCDF-4 parallel tests pass, then the problem is likely in user code. Have these tests been run on hercules? Who did the install there? @AlexanderRichert-NOAA do you know if the netCDF and HDF5 parallel I/O tests were run on hercules? |
Excellent, thanks! I will add this to the hercules intel config in spack-stack, too. So this hopefully solves the Intel problems discussed here. There are still gnu problems discussed elsewhere, correct? |
…nodi + #2047, #2053, and #2056 (#2044) FV3 diagnostic fixes, CCPP fixes for model crashes, new PR template - UFS: - commit message in PR template (#2053) - fix hercules crashes (#2015) - CMEPS & FV3: Bad data from in CCPP CLM Lake physics scheme caused model crashes - Communicate changes to lake ice (Closes #2055, NOAA-EMC/CMEPS#105, NOAA-EMC/fv3atm#741) - unit mismatch (NOAA-EMC/fv3atm#736) - FV3: correct errors in diagnostic calculations - snodi had weasdi data in it (NOAA-EMC/fv3atm#736) - revisions to RUC LSM snowfall melting and accumulation (NOAA-EMC/fv3atm#739)
@edwardhartnett will you have any suggestions on further investigating of this issue? The problem could occur for either nf90_get_var or nf90_put_var ( in the context of a parallel IO (the one of netcdf 4 on top of hdf5)) |
I thought this was resolved for Intel with the additonal env variable (see lengthy discussion above). For GNU, please try the alternative spack-stack-1.6.0 [email protected][email protected] installation. |
Sorry, I asked in a wrong issue. Thanks @climbfuji . The issue i asked for is NOAA-EMC/GSI#684. |
Did HDF5 pass tests with --enable-parallel?
Did netCDF pass tests with --enable-parallel-tests?
Ed
…On Fri, Feb 2, 2024 at 12:29 PM TingLei-NOAA ***@***.***> wrote:
Sorry, I asked in a wrong issue. Thanks @climbfuji
<https://github.com/climbfuji> . The issue i asked for is NOAA-EMC/GSI#684
<NOAA-EMC/GSI#684>.
—
Reply to this email directly, view it on GitHub
<#2015 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJIOMMANKFARVHEHWZH4AM3YRU5CLAVCNFSM6AAAAAA77QG43KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGU2DOMBRGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Edward Hartnett
CIRES/NOAA EMC
|
Description
The control_wrtGauss_netcdf_parallel_intel fails on hercules (intel) with the following error:
which results in
To Reproduce:
Attempt to run this test on Hercules, check the
atmf024.nc_nccmp.log
file in the run directory for the error message.Additional context
The issue was first noted in PR #1990.
Output
The text was updated successfully, but these errors were encountered: