-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test spack-stack provided on Orion #1310
Comments
Opened JCSDA/spack-stack#471 to get the following modules added to the
Also need gempak but it's not currently within spack-stack. Will load it separately for now and look at submitting a spack-stack issue to get it added to spack-stack. |
Status summaryI am able to run most of the system using the provided spack-stack install on Orion, however there are a few stumbling blocks to resolve:
Working with @climbfuji and @ulmononian I was able to get past hanging python scripts by switching to a different spack-stack environment put together by @ulmononian:
This helped with issue 1. Issue 2 now occurs (once able to get to that step). Last reply to email thread:
This is where I am currently stuck. |
Status Summary I was able to get past the hanging python utilities by using a different spack-stack install made by @ulmononian (without MPI):
The python in the analcalc jobs completed without hanging...however, the
Log (starting line 1075): /work/noaa/stmp/kfriedma/comrot/spackcyc192/logs/2022010200/gfsanalcalc.log Word from @GeorgeVandenberghe-NOAA is that the GSI fails with Currently stuck here...need to retest or rebuild the GSI with Test on Orion:
Note: clone is now several months old and does not have all GFSv17-dev updates since then, particularly the COM reorg updates that went in this week. Likely need to redo spack-stack updates in all components in new clone using global-workflow |
@KateFriedman-NOAA for what it's worth, looking at your UFS build logs, it looks like it's still using hpc-stack's parallelio somehow. The one in spack-stack is a shared library, so I think your UFS CMakeLists.txt and CMakeModules will need to be updated (see the current UFS develop branch; basically, use a more recent commit of CMakeModules, and remove the |
Ah, yes, let me double check this, I now remember having an issue with the UFS build. Let me get back to you on this... |
Remembering the good old days when we just integrated or numerically
solved partial differential equations. :-)
…On Mon, May 1, 2023 at 9:55 AM Kate Friedman ***@***.***> wrote:
@KateFriedman-NOAA <https://github.com/KateFriedman-NOAA> for what it's
worth, looking at your UFS build logs, it looks like it's still using
hpc-stack's parallelio somehow. The one in spack-stack is a shared library,
so I think your UFS CMakeLists.txt and CMakeModules will need to be updated
(see the current UFS develop branch; basically, use a more recent commit of
CMakeModules, and remove the STATIC from the PIO line in
ufs_model.fd/CMakeLists.txt).
Ah, yes, let me double check this, I now remember having an issue with the
UFS build. Let me get back to you on this...
—
Reply to this email directly, view it on GitHub
<#1310 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSVPEL7LOCVESUY5T3XD66DJANCNFSM6AAAAAAU2UPJQE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
One concern I have with the shared library is that we can't run a executable on different platforms, e.g. run executable compiled on wcoss2 on gaea C5, or hera executable on Orion (we recently confirmed the baselines on hera/orion reproduce). |
@junwang-noaa have you done the former (build on WCOSS2->run on C5) with executables built with hpc-stack? With WCOSS2, I think there are some inevitable shared dependencies because of Cray, but I haven't tried it (I guess those same libraries exist on C5, just different versions potentially, like cray-mpich). |
Okie dokie...so I had changed
...but I remember trying to make that change and having issues so, like you said, using a more recent version would help. I will see if I can find time to try that, I'd probably want to update to newer versions of all components and remake the module/stack changes too. Will see if I can find time to do that before I go on leave this month. I have a bunch of higher priority tasks to complete first though. Thanks! |
@KateFriedman-NOAA looking at your gfsanalcalc.log, I just noticed you're getting an issue that I was running into: It looks like this variable gets set by the impi/2022.1.2 module and the value looks correct, so I don't know why it's not getting propagated... |
By setting I_MPI_PMI_LIBRARY I'm able to get to the point where chgres_inc.x just hangs indefinitely. Switching to hdf5/1.12.2 at runtime does not fix it, but I have not yet tried building with it. I have also tried building and running the executable using intel 2022 to avoid the mismatch between modules (it was building with old intel but running with new intel), but no luck. It appears that none of the processes are making it through the |
No luck with [email protected], nor with [email protected]... |
Can yoy try building with hdf5 1.12.2 and the default netcdf combination in the spack-stack ue?
… On May 2, 2023, at 11:44 PM, Alex Richert ***@***.***> wrote:
No luck with ***@***.*** :(
—
Reply to this email directly, view it on GitHub <#1310 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RIAR42ZLGMICWNNHW3XEHWE7ANCNFSM6AAAAAAU2UPJQE>.
You are receiving this because you were mentioned.
|
Yeah, I tried building and running that executable with spack-stack-1.3.1 ([email protected]/[email protected]/[email protected]) and it was still hanging in the same place. |
Does the executable built with the older hpc-stack stuff, run to completion
or does that also hang?
…On Wed, May 3, 2023 at 4:13 PM Alex Richert ***@***.***> wrote:
Yeah, I tried building and running that executable with spack-stack-1.3.1
***@***.******@***.******@***.***) and it was still
hanging in the same place.
—
Reply to this email directly, view it on GitHub
<#1310 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FVJ75Q3ZR24VZVIXN3XEJ7YPANCNFSM6AAAAAAU2UPJQE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I just tested that-- It does run to completion building with the existing hpc-stack libraries, so it's not a matter of misconfiguration at runtime of MPI settings, etc. I'm building my own stack with intel 2018 and I'll just keep changing things until it works (probably starting with hdf5 version)... For my own memory-- I've now tried hdf5 builds with and without thread safety enabled, so that doesn't seem to be the issue. |
I built a spack-stack-based stack with [email protected] and, for better or for worse, that fixed it. I'm going to see if I can narrow down the version where it breaks. It seems odd that it would break in the context of a high-level netcdf function (n*_get_var), so I wonder if it has to do with how it's being used in the context of MPI. |
I ran into gsi hang issues in the benchmark package with hdf5/1.14.0 and
GSI people have not run it down so I had to give up
and provide and build a separate set of hdf5/1.10.6 dependencies for the
GSI. To avoid dependency clashes (mistakes happen) I also
separated the benchmarks so they're now GSI and .. everything.else.
Running this down and backtracking delayed benchmark release for ten days
A separate set of issues occurred with netcdf/-fortran/4.6.0 in the GFDL
MOM6 ocean model coupled with UFS. This appears to be a MOM6 problem but it
is masked by using netcdf-fortran/4.5.3 so I backtracked and used that.
…On Thu, May 4, 2023 at 6:52 PM Alex Richert ***@***.***> wrote:
I built a spack-stack-based stack with ***@***.*** and, for better or for
worse, that fixed it. I'm going to see if I can narrow down the version
where it breaks. It seems odd that it would break in the context of a
high-level netcdf function (n*_get_var), so I wonder if it has to do with
how it's being used in the context of MPI.
—
Reply to this email directly, view it on GitHub
<#1310 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQA7RJCTIPUFQFEV6DXEP3FHANCNFSM6AAAAAAU2UPJQE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Here's a summary of my findings on Orion in terms of hdf5 versions and interp_inc.x: So whatever the issue is, it appears to have emerged in the 1.10.8->1.10.9 transition. I'll see if I can pin down the culprit. |
To document this from an email thread with @climbfuji:
|
Before I go digging deeper in terms of hdf5 code, @KateFriedman-NOAA if we had an intel 18-based stack with [email protected], would that at least take care of things in the short-to-mid term, especially in terms of transitioning to spack-stack? |
For supporting the GSI moving to spack-stack, yes, probably. We would need to decide if and how we mix intel versions across the GFS or not...meaning, GSI builds with 2018 but all other pieces build with 2022 and the whole system (from the workflow level) runs with 2022 loaded. Our move to EPIC stack is planning to move everything to 2022 with the GSI being a question right now as well. |
Is there any path forward for the GSI to address it's issues with
intel/2022.x and with HDF5/1.12 and 1.14?
…On Tue, May 9, 2023 at 4:49 PM Kate Friedman ***@***.***> wrote:
if we had an intel 18-based stack with ***@***.***, would that at least
take care of things in the short-to-mid term, especially in terms of
transitioning to spack-stack?
For supporting the GSI moving to spack-stack, yes, probably. We would need
to decide if and how we mix intel versions across the GFS or not...meaning,
GSI builds with 2018 but all other pieces build with 2022 and the whole
system (from the workflow level) runs with 2022 loaded.
Our move to EPIC stack is planning to move everything to 2022 with the GSI
being a question right now as well.
—
Reply to this email directly, view it on GitHub
<#1310 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSRMKU2DB5LFKEKMYDXFJYSPANCNFSM6AAAAAAU2UPJQE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Okay, I at least have a fix for the gfsanalcalc/interp_inc.x issue. For whatever reason, as of [email protected], certain parallel operations require the involvement of the root process (MPI_RANK==0), and if it's not available for work, it hangs. So my workaround is to make [MPI world size]-1 the root process (replace I'll submit a bug report to HDF5 and see what they say, and I'll cross-reference the issue here. @GeorgeVandenberghe-NOAA any chance this could explain other issues you've encountered? And if so, is this kind of workaround viable as far as level of effort required? |
I will have to think about this one but first thought is that mpi_io
operations are collective and all ranks in a collective must complete or
the collective hangs.. standard behavior. If so this is an application
design flaw or error.
On Friday, May 19, 2023, Alex Richert ***@***.***> wrote:
Okay, I at least have a fix for the gfsanalcalc/interp_inc.x issue. For
whatever reason, as of ***@***.***, certain parallel operations require the
involvement of the root process (MPI_RANK==0), and if it's not available
for work, it hangs. So my workaround is to make [MPI world size]-1 the root
process (replace mype == 0 with mype = npes-1 and so on in
netcdf_io/interp_inc.fd/driver.f90), such that the parallel work is done on
ranks 0 through 8 and rank 9 is the "root" MPI process. See
/work/noaa/nems/arichert/develop_spack/sorc/gsi_utils.fd/src/netcdf_io/interp_inc.fd/driver.f90_WORKING
I'll submit a bug report to HDF5 and see what they say, and I'll
cross-reference the issue here. @GeorgeVandenberghe-NOAA any chance this
could explain other issues you've encountered? And if so, is this kind of
workaround viable as far as level of effort required?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.<
…--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Being continued in issue #1868. Closing. |
Updates to the `gdas.cd` #. @RussTreadon-NOAA will submit a PR in the GDASApp, we'll update the gdas.cd # in this branch after the GDASApp PR is merged. In the mean time, could somebody review the few simple code changes that are needed to run with the new #? - [x] Depends on GDASApp PR [#1310](NOAA-EMC/GDASApp#1310) - [x] Depends on g-w issue [#3012](#3012) --------- Co-authored-by: Rahul Mahajan <[email protected]> Co-authored-by: RussTreadon-NOAA <[email protected]> Co-authored-by: RussTreadon-NOAA <[email protected]>
Description
This issue documents work to test out a test install of spack-stack on Orion provided by @climbfuji and others. Details provided by Dom:
Issues encountered during testing and requests to add modules will be submitted via the spack-stack repo: https://github.com/NOAA-EMC/spack-stack
Primary relevant spack-stack issue: JCSDA/spack-stack#454
The text was updated successfully, but these errors were encountered: