Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating-point divide by zero exception in ssnow%smp_hys computation #396

Open
SeanBryan51 opened this issue Sep 10, 2024 · 3 comments
Open

Comments

@SeanBryan51
Copy link
Collaborator

SeanBryan51 commented Sep 10, 2024

Hacking a temporary fix for #395 and running CABLE-MPI offline (main branch - commit 95b9b5e) using the crujra_accessN96_1h configuration results in the following divide by zero exception:

[gadi-cpu-clx-2663:671237:0:671237] Caught signal 8 (Floating point exception: floating-point divide by zero)
==== backtrace (tid: 671237) ====
 0 0x0000000000012d20 __funlockfile()  :0
 1 0x00000000006c2338 cable_param_module_mp_derived_parameters_()  /home/189/sb8430/cable/src/offline/cable_parameters.F90:2320
 2 0x00000000006228a2 cable_input_module_mp_load_parameters_()  /home/189/sb8430/cable/src/offline/cable_input.F90:2829
 3 0x0000000000416cd2 cable_mpimaster_mp_mpidrv_master_()  /home/189/sb8430/cable/src/offline/cable_mpimaster.F90:601
 4 0x000000000040e5fc MAIN__()  /home/189/sb8430/cable/src/offline/cable_mpidrv.F90:54
 5 0x000000000040da22 main()  ???:0
 6 0x000000000003a7e5 __libc_start_main()  ???:0
 7 0x000000000040d92e _start()  ???:0
=================================
forrtl: error (75): floating point exception
Image              PC                Routine            Line        Source             
cable-mpi          0000000000C7E474  Unknown               Unknown  Unknown
libpthread-2.28.s  0000155548397D20  Unknown               Unknown  Unknown
cable-mpi          00000000006C2338  cable_param_modul        2320  cable_parameters.F90
cable-mpi          00000000006228A2  cable_input_modul        2829  cable_input.F90
cable-mpi          0000000000416CD2  cable_mpimaster_m         601  cable_mpimaster.F90
cable-mpi          000000000040E5FC  MAIN__                     54  cable_mpidrv.F90
cable-mpi          000000000040DA22  Unknown               Unknown  Unknown
libc-2.28.so       0000155547C677E5  __libc_start_main     Unknown  Unknown
cable-mpi          000000000040D92E  Unknown               Unknown  Unknown

The exception occurs on this line of the code:

(ssnow%ssat_hys(i,k)-ssnow%watr_hys(i,k)) )**&

It looks like ssnow%ssat_hys(i,k) and ssnow%watr_hys(i,k) are both uninitialised and contain the same garbage value, causing the subtraction of the two values to result in divide by zero.

Steps to reproduce (Gadi)

Apply the following patch to fix the error described in #395 (WARNING - this patch is untested and should not be used for work other than reproducing this issue):

diff --git a/src/offline/cable_parameters.F90 b/src/offline/cable_parameters.F90
index b6133f6..c741eaf 100644
--- a/src/offline/cable_parameters.F90
+++ b/src/offline/cable_parameters.F90
@@ -3340,11 +3340,11 @@ CONTAINS
     totdepth = 0.0
     DO is = 1, ms-1
        totdepth = totdepth + soil_zse(is) * 100.0  ! unit in centimetres
-       veg%froot(:, is) = MIN( 1.0, 1.0-veg%rootbeta(:)**totdepth )
+       veg%froot(ifmp:fmp, is) = MIN( 1.0, 1.0-veg%rootbeta(ifmp:fmp)**totdepth )
     END DO
-    veg%froot(:, ms) = 1.0 - veg%froot(:, ms-1)
+    veg%froot(ifmp:fmp, ms) = 1.0 - veg%froot(ifmp:fmp, ms-1)
     DO is = ms-1, 2, -1
-       veg%froot(:, is) = veg%froot(:, is)-veg%froot(:,is-1)
+    veg%froot(ifmp:fmp, is) = veg%froot(ifmp:fmp, is)-veg%froot(ifmp:fmp,is-1)
     END DO

   END SUBROUTINE init_veg_from_vegin

The steps to reproduce the error are the same as that described in #395.

@SeanBryan51
Copy link
Collaborator Author

@rkutteh @ccarouge FYI this issue looks like it is related to the GW work.

Currently all ssnow%*_hys variables are uninitialised causing the exception. It looks like initialisation of some ssnow%*_hys variables occur in the subroutine GWspatialParameters here:

ssnow%smp_hys(:,:) = -soil%sucs_vec(:,:)
ssnow%hys_fac(:,:) = 1.0
ssnow%watr_hys(:,:) = soil%watr(:,:)
ssnow%ssat_hys(:,:) = soil%ssat_vec(:,:)

Note: GWspatialParameters does not seem to initialise the ssnow%sucs_hys or ssnow%wb_hys variables.

For the next GW changes, are there plans to remove the problematic code, i.e:

! ____________________ MMY comment out as we don't use hys ___________________________
do k=1,ms
do i=1,mp
if (ssnow%wb_hys(i,k) .lt. 0._r_2) then
ssnow%wb_hys(i,k) = ssnow%wb(i,k)
end if
ssnow%wb_hys(i,k) = max(soil%watr(i,k) ,min(soil%ssat_vec(i,k), ssnow%wb_hys(i,k)))
if (ssnow%smp_hys(i,k) .lt. -1.0e+30_r_2) then !set to missing, calc
ssnow%smp_hys(i,k) = -soil%sucs_vec(i,k)* &
( (ssnow%wb_hys(i,k)-ssnow%watr_hys(i,k))/&
(ssnow%ssat_hys(i,k)-ssnow%watr_hys(i,k)) )**&
(-1._r_2/soil%bch_vec(i,k) )
end if
ssnow%smp_hys(i,k) = max(-1.0e10,min(-soil%sucs_vec(i,k),ssnow%smp_hys(i,k) ))
end do
end do

or ensure all ssnow%*_hys variables are initialised?

@rkutteh
Copy link
Collaborator

rkutteh commented Nov 11, 2024

@SeanBryan51 @ccarouge As Claire already knows, I have fixed all these bugs in my GW branch that is now in the process of making its way into the trunk. My own view is to wait a bit until this process is finished (this month I think) so as to avoid reinventing the wheel. Just for the record, I had compiled my GW branch with "check all" and fixed every bug it flagged.

@SeanBryan51
Copy link
Collaborator Author

@rkutteh -check and -ftrapuv are not 100% reliable in finding uninitialised vars (see this talk for more info). Runtime memory checking tools are more robust. I have been using ddt with memory debug settings enabled which I recommend. It is easy to run CABLE with ddt on Gadi using offline debugging:

module load linaro-forge/24.0.2
ddt --offline --mem-debug=balanced mpiexec -n <NCPUS> ./cable-mpi

Happy to share more details if you are interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants