Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ww3_multi hangs on when creating restart file with IOSTYP=2 or =3 #290

Closed
mickaelaccensi opened this issue Jan 7, 2021 · 7 comments · Fixed by #327
Closed

ww3_multi hangs on when creating restart file with IOSTYP=2 or =3 #290

mickaelaccensi opened this issue Jan 7, 2021 · 7 comments · Fixed by #327
Assignees
Labels
bug Something isn't working

Comments

@mickaelaccensi
Copy link
Collaborator

the bug appears when using IOSTYP 2 or 3, it works well with IOSTYP=1

By tracking where it keeps waiting, it seems to be for some processors in w3wavemd :
CALL MPI_WAITALL due to positive value of NRQSG2

and for the dedicated output processor in w3iorsmd :

!/MPI                        CALL MPI_WAITALL                         &
!/MPI                           ( 1, IRQRSS(IB), STAT1, IERR_MPI )

I'll look for a regtest that highlights the bug

@mickaelaccensi mickaelaccensi added the bug Something isn't working label Jan 7, 2021
@mickaelaccensi
Copy link
Collaborator Author

mickaelaccensi commented Jan 8, 2021

this happens only when :
IOSTYP=2 or =3
PSHARE = T
and an output restart file is requested

To test it with a regtest :

add a restart output in mww3_test_04/input/ww3_multi_grdset_d.nml

&OUTPUT_DATE_NML
  ALLDATE%FIELD          = '19680606 000000' '1200' '19680608 000000'
  ALLDATE%POINT          = '19680606 000000' '3600' '19680608 000000'
  ALLDATE%RESTART        = '19680606 020000' '3600' '19680606 020000'
/

then run the regtest :

./bin/run_test -o both -N -f -S -T -c datarmor_intel_debug -s PR1_MPI -w work_PR1_MPI_d -m grdset_d -f -p $MPI_LAUNCH -n 28 ../model mww3_test_04

it will never end..

when you kill it, here are the lines where it's locked :

ww3_multi          00000000012BC688  w3iorsmd_mp_w3ior         542  w3iorsmd.F90
ww3_multi          0000000000DAA809  w3wavemd_mp_w3wav        1413  w3wavemd.F90
ww3_multi          00000000008BA7CD  wmwavemd_mp_wmwav         871  wmwavemd.F90
ww3_multi          0000000000405363  MAIN__                    150  ww3_multi.F90

line 542 in w3iorsmd.F90 is the MPI_WAITALL function :

                    IF ( IAPROC .EQ. NAPRST ) THEN
!
                        IH     = 1 + NRQ * (IB-1)
                        CALL MPI_WAITALL                         &
                           ( NRQ, IRQRSS(IH), STAT1, IERR_MPI )

bug introduced by commit e756361

@ukmo-ccbunney , @ukmo-juan-castillo , @ukmo-ansaulter, could you correct this bug ?

@mickaelaccensi
Copy link
Collaborator Author

Hi Guys,

cold you please look at this bug ? I'm not able to upgrade my forecast system with the last version of ww3 due to this bug. thanks

@ukmo-ccbunney
Copy link
Collaborator

Hi @mickaelaccensi
Apologies for the delay - I am really struggling for time at the moment!
I can confirm that this is hanging for me too on the GNU compiler when writing out the restart file.
I'll chat with @ukmo-juan-castillo today and see if we can get to the bottom of it.
Chris.

@ukmo-juan-castillo
Copy link
Collaborator

Sorry for the delay, I have been quite busy last week. I will start working on this now and give it all my priority. I think I know where the problem is and it should be easily fixed.

@ukmo-juan-castillo
Copy link
Collaborator

I run some tests and it looks like this bug was present before merging the new coupling changes. In any case, as these particular lines of code were in my list of things to look at during the optimization issue, I am trying to fix the problem.

I narrowed the problem to the communication handlers, that are somehow overwritten. This points to an 'out of bounds' error or similar. When I tried to compile in debug mode I obtained several errors. I reckon that fixing those errors will probably fix the problem.

@JessicaMeixner-NOAA
Copy link
Collaborator

So I just noticed that the test:
run_test -s PR1_MPI -w work_PR1_MPI_e -m grdset_e -f -p mpirun -n 4 ../model mww3_test_03
hangs. It does not hang with other number of tasks, but with 4 tasks it will hang. The code in the PR ukmo-waves#18 solves this problem.

ukmo-ccbunney pushed a commit to ukmo-waves/WW3 that referenced this issue Mar 4, 2021
* Fixes NOAA-EMC#290 (ww3_multi hanging when generating restart with IOSTYP >= 2)
* Also fixes out-of-bounds array access error.
* Includes some MPI optimizations
@ukmo-juan-castillo
Copy link
Collaborator

ukmo-juan-castillo commented Mar 11, 2021

@ukmo-ccbunney found that this bug fix also affect the oasis regtests. After careful examination, I have found a more satisfactory solution that solves the problems in both the 'multi' and the 'oasis' regtests. This bugfix will affect these configurations, and in particular it will change the restart file of multi configurations.

The changes will be made in the staging branch and tested there.

aliabdolali pushed a commit that referenced this issue Mar 19, 2021
* First set of changes intended to fix the bug (#19)

Fixes: #314
* Interpolation weights now correctly calculated on points next to land and BC locations.
* Changes to improve the code: the possibility of reading zero values from
the input is considered, and points that should not be taken into
account in the interpolation are identified by the netcdf fill value; a
subroutine is created to avoid code duplication

* Bug fix and small simplification/optimization change (#18)

* Fixes #290 (ww3_multi hanging when generating restart with IOSTYP >= 2)
* Also fixes out-of-bounds array access error.
* Includes some MPI optimizations

* Correction to the bug fix in branch bf_multi_hang
to take into account the coupled
    configurations, that are also affected

* Small correction to the multi_hang branch: revert changes to JSEA index
in w3iorsmd

Co-authored-by: Juan Manuel Castillo Sanchez <[email protected]>
Co-authored-by: ukmo-juan.castillo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants