-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
g-w CI C96C48_hybatmaerosnowDA fails on WCOSS2 #1336
Comments
10/18/2024 update Examine code in Cory found a unidata/netcdf issue reporting illegal characters which appeared to be related to the netcdf version. Beginning to think this may be the issue on WCOSS2. All other platforms build GDASApp with
WCOSS2 uses
Find
on WCOSS2 but attempts to build with these have not yet been successful. Still working through various combinations of module versions to see if we can build GDASApp on WCOSS2 using newer netcdf versions. It would be nice if WCOSS2 had available the same spack-stack used on NOAA RHDPCS machines. |
10/20/2024 update Unable to find combination of hpc-stack modules to successfully build and/or run In the interim modify
FMS2_IO is the default build option. Adding Do the following:
Modified If we are OK with the modified |
WCOSS2 test Install Run g-w CI on Cactus for
with results as follows
The WCDA failure is the same as before
This failure is not related to changes in g-w PR . This failure also occurs when using GDASApp |
spack-stack update Commit
|
@CoryMartin-NOAA , @guillaumevernieres , @danholdaway , @DavidNew-NOAA : Are we OK with thefollowing incremental approach? First,
Second, If we are OK with the items under First, I'll get to work and make it so. |
@RussTreadon-NOAA Fine by me, but FYI NOAA-EMC/global-workflow#2949 will not work on WCOSS when it that PR is merged. The FMS2 IO module in FV3-JEDI also includes non-restart read/write capability which is needed for native grid increments in that PR. Hopefully we sort the FMS2 IO issue out before it goes into review. This PR won't hold that up, because FMS2 IO isn't working anyway on WCOSS. Like I said, just an FYI. |
@DavidNew-NOAA , does your comment
refer the fact that ... |
I don't have a WCOSS2 We face a decision for WCOSS2 GDASApp builds:
Of course, if we can find a combination of existing WCOSS2 modules that work with FMS2_IO, choices 1 and 2 become moot. Thus far, I have not been able to find this combination. |
My preference, while not ideal, is 2, as we have relatively soon deadlines for aero/snow and not for atm cycling. Do we know for sure it's a library issue? |
Can't say for sure but I studied the fv3-jedi fms2 code in depth on Thu-Fri with lots of prints added. Nothing jumps out as being wrong. The code as_is works fine on Hera, Hercules, and Orion. These machines build GDASApp with newer intel compilers and spack-stack. Hence the hypothesis that the Cactus failures are due to the older intel compiler and/or the hpc-stack modules we load. Once Acorn queues are opened I can run a build of g-w PR #2978 with GDASApp using spack-stack/1.6.0 (same version we use on NOAA RDHPCS) and see if the failing Cactus jobs run OK. |
@RussTreadon-NOAA I assume it atmospheric cycling will not work on WCOSS, because g-w PR #2949 will reintroduce FMS (2). Currently atmospheric cycling uses cubed sphere histories to write increments. |
@RussTreadon-NOAA I was able to get the convertstate job to run to completion using Please note that the compile/build step will fail on the IMS Snow Proc linking, not sure why yet, but I think this is progress. |
@CoryMartin-NOAA , this is good new. How did you get the convertstate executable if I traveled down this road earlier and like you got a bunch of undefined netcdf references
I found way to get pass this by adding netcdf to the modulefile but then the executable failed with run time errors. When my build above failed, I didn't find |
my build included gdas.x before it failed, perhaps that is/was the luck of the draw? I'll dig in to see why the IMS code is not linking to netCDF properly |
@RussTreadon-NOAA try the lastest commit to that GDASApp branch. I'm able to get IMS to compile now within the DA-Utils repo. If this works, I can clean it all up. I'll try it from scratch now. |
Clone GDASApp Rewind and reboot 20211220 12Z gdas_aeroanlgenb. All executable in this job run to completion. Rewind and reboot 20211220 18Z gdas_snowanl. The job failed with
This is not surprising. |
@RussTreadon-NOAA this is encouraging. I'm looking into the CMake, I think there's an issue with the static libraries and it is only linking the netCDF fortran (or in bufr-query's case, netCDF-C++), and not the other required dependencies. |
@CoryMartin-NOAA : Do we need to add a missing |
I think all of the above + HDF5 , but I'm looking into seeing if there is a simpler fix. I thought |
Cactus @DavidHuber-NOAA pointed me at at a test Given this success, set up g-w CI for four DA configurations
Note: g-w used for these tests is All four g-w CI streams are still running. JEDI based DA streams have completed first cycle JEDI DA jobs. Of particular interest are the successful completion of
An unexpected bonus is that
This issue will be updated once all four g-w CI DA streams complete. |
Cactus g-w CI
WCDA CI remains active because 20210324 18Z gdas_prep is waiting for |
This issue can be closed once |
When g-w CI C96C48_hybatmaerosnowDA is run using g-w PR #2978, the following jobs abort on WCOSS2 (Cactus)
gdas_aeranlgenb aborts while executing
gdas.x fv3jedi convertstate
usingchem_convertstate.yaml
gdas_snowanl aborts while executing
gdas.x fv3jedi localensembleda
usingletkfoi.yaml
The error message is the same for both failures.
These jobs successfully run to completion on Hera, Hercules, and Orion. GDASApp is built with newer intel compilers and different modules on these machines. It is not clear if the older intel/19 compiler or modulefules used on Cactus are the issue or if there is an actual bug in the JEDI code which needs to be fixed.
This issue is opened to document the WCOSS2 failure and its resolution.
The text was updated successfully, but these errors were encountered: