Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in w3iors use NSEA instead of NSEAL in serial write/read of VA #954

Merged

Conversation

JessicaMeixner-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented Mar 21, 2023

Pull Request Summary

Change NSEAL-> NSEA in w3iors in serial section of the code for writing/reading VA

Description

In w3iors in the serial section, NSEAL is used and while this "works" NSEAL is a variable defined for MPI and if you use w3iors in a non ww3_shel or ww3_multi program and this is not set, it could break the program.

Please also include the following information:

Issue(s) addressed

fixes #940

Commit Message

in w3iors use NSEA instead of NSEAL in serial write/read of VA

Check list

Testing

  • How were these changes tested? with matrix plus one extra serial restart test
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) yes
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)? intel, hera
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.) No expected changes in answers
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (14 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (15 files differ)
ww3_ta1/./work_UPD0F_U                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_a                     (12 files differ)
ww3_tp2.17/./work_c                     (12 files differ)
ww3_tp2.17/./work_b                     (12 files differ)
ww3_tp2.6/./work_ST0                     (1 files differ)
ww3_tp2.6/./work_ST4                     (1 files differ)
ww3_tp2.6/./work_pdlib                     (1 files differ)
ww3_ts4/./work_ug_MPI                     (1 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

@JessicaMeixner-NOAA JessicaMeixner-NOAA added the bug Something isn't working label Mar 21, 2023
@JessicaMeixner-NOAA
Copy link
Collaborator Author

@MatthewMasarik-NOAA the regtests were run with an earlier develop. I'm leaving that as is for now unless you want me to run additional tests in addition to the ones you've made. If someone else wants to do a check to insure I haven't missed other NSEAL in w3iors that's probably a good idea too.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Thanks to @MatthewMasarik-NOAA who caught an error I did not:

There is an error when running ww3_shel for:
Running now options: run_test -b slurm -o all -S -T -s MPI -s NO_PDLIB -w work_a -g a -f -p srun -n 24 ../model ww3_tp2.17

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           0000000000675A1A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B0DDE84E630  Unknown               Unknown  Unknown
ww3_shel           000000000057C5F2  w3iorsmd_mp_w3ior         636  w3iorsmd.F90
ww3_shel           00000000005D1D47  w3wavemd_mp_w3wav        2610  w3wavemd.F90
ww3_shel           0000000000410083  MAIN__                   2634  ww3_shel.F90
ww3_shel           00000000004069E2  Unknown               Unknown  Unknown
libc-2.17.so       00002B0DDEA7D555  __libc_start_main     Unknown  Unknown
ww3_shel           00000000004068E9  Unknown               Unknown  Unknown

I'm going to mark this as a draft for now until I can figure out why this is failing. In the meantime I'll move forward with the gint update that sets NSEAL=NSEA.

FYI @ukmo-ccbunney since I know you're running regtests.

@JessicaMeixner-NOAA JessicaMeixner-NOAA marked this pull request as draft March 22, 2023 19:39
@ukmo-ccbunney
Copy link
Collaborator

Thanks to @MatthewMasarik-NOAA who caught an error I did not:

There is an error when running ww3_shel for: Running now options: run_test -b slurm -o all -S -T -s MPI -s NO_PDLIB -w work_a -g a -f -p srun -n 24 ../model ww3_tp2.17

Hi @JessicaMeixner-NOAA
Yes - I am getting a segfault there too. It looks like it is ocuring during the generation of the restart file (last file touched is restartNNN.ww3.

It looks like it is failing around line 636 in w3iorsmd, probably because of a out-of-range JSEA value:

DO JSEA=1, NSEA
            CALL INIT_GET_ISEA(ISEA, JSEA)
            NREC   = ISEA + 2
            RPOS  = 1_8 + LRECL*(NREC-1_8)
            WRITEBUFF(:) = 0.
            WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA)   # <- failing here...
            WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF
END DO

I think this is happening because this block of code is entered for both MPI and SHRD use cases, so the loop would have to be over NSEAL to work when run under MPI (as in this case)

As far as I can tell, the change you made at line 736 is safe as that bit of code only ever gets executed if the model is run in SHRD mode thanks to the #ifdef on line 738 and the conditional on line 740.

I guess there are two options here:

  1. Make the loop variable NSEA for SHRD mode and NSEAL for MPI mode on line 631 (with #ifdefs? You could ditch the INIT_GET_ISEA call in SHRD mode too...)
  2. Stick to the original code and make the assumption that the calling routine always sets NSEAL appropriately even in SHRD mode

@ukmo-ccbunney
Copy link
Collaborator

Just to confirm, modifying w3iorsmd as per below fixes the issue (and the restarts are b4b with develop).
It's a bit ugly though - perhaps it would look nicer with a separate loop variable for the DO loop that is set to NSEA or NSEAL respectively.

! w3iorsmd.f90: Line 631
#ifdef W3_MPI        
          DO JSEA=1, NSEAL
            CALL INIT_GET_ISEA(ISEA, JSEA)
#else            
          DO JSEA=1, NSEA
            ISEA = JSEA
#endif          
            NREC   = ISEA + 2
            RPOS  = 1_8 + LRECL*(NREC-1_8)
            WRITEBUFF(:) = 0.
            WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA)
            WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF
          END DO

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Okay - I finally updated the code with @ukmo-ccbunney suggestion and am re-running regtests.

@JessicaMeixner-NOAA JessicaMeixner-NOAA marked this pull request as ready for review April 25, 2023 13:28
@JessicaMeixner-NOAA
Copy link
Collaborator Author

A re-run of the regtests provide expected results:

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (9 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
ww3_ta1/./work_UPD0F_U                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

@ukmo-ccbunney
Copy link
Collaborator

Changes look good to me. Just running regression tests now.

@MatthewMasarik-NOAA
Copy link
Collaborator

Changes look good to me. Just running regression tests now.

Awesome. Thanks for testing @ukmo-ccbunney

Copy link
Collaborator

@MatthewMasarik-NOAA MatthewMasarik-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review Pass

Testing Pass

Regressions tests show just the known non-b4b's as non-identical.

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (10 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (15 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (6 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

**********************************************************************
************************ identical cases *****************************
**********************************************************************

matrixCompSummary.txt
matrixCompFull.txt
matrixDiff.txt

@MatthewMasarik-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA, thank you for this clarification of loop limit for serial IO.

@MatthewMasarik-NOAA MatthewMasarik-NOAA merged commit 7229a4c into NOAA-EMC:develop May 3, 2023
MatthewMasarik-NOAA added a commit to MatthewMasarik-NOAA/WW3 that referenced this pull request Jun 27, 2023
* origin/develop:
  handle NaN air-sea temperatures from nearest land points (NOAA-EMC#869)
  Increase valid_max for f in ounp (NOAA-EMC#1014)
  Bugfix deallocation of invalid memory in ww3_prnc (NOAA-EMC#1016)
  Update to orion intel module path and two typo corrections. (NOAA-EMC#1011)
  Bugfix to out of bounds array write in w3profsmd_pdlib.f90 (NOAA-EMC#1013)
  Simple logic fix for time interpolation of boundary nodes at the end of W3XYPFSNIMP. (NOAA-EMC#1005)
  in w3iors use NSEA instead of NSEAL in serial write/read of VA (NOAA-EMC#954)
  Update documentation for UNST namelist (NOAA-EMC#986)
  In certain coupled configurations, the piece of code testing the coupling frequency to check if 'receive' coupling exchanges need to take place fail, resulting in an infinite loop causing the integration between time zero and the first time step to repeat indefinitely. This check needs to be rewritten, which fixes also issue NOAA-EMC#816 in a simpler way.  (NOAA-EMC#999)
@JessicaMeixner-NOAA JessicaMeixner-NOAA deleted the bug/nsealw3iors branch December 12, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

w3iors NSEAL vs NSEA in non-MPI portions of read/write
3 participants