-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cicecore: fix some use of uninitialized variables #560
cicecore: fix some use of uninitialized variables #560
Conversation
When CICE is compiled with the 'CICE_IN_NEMO' CPP macro, and 'calc_Tsfc' is false, ice_step::coupling_prep calls CICE_RunMod::srcflux_to_ocn to transfer heat fluxes on ice-free grid points to the ocean. The calculation in srcflux_to_ocn uses the Icepack parameter 'Lsub', but without declaring this variable nor querying it from Icepack, which results in a compile-time failure when using the standalone driver with 'CICE_IN_NEMO' defined. Fix that by correctly declaring 'Lsub' and querying it using the Icepack interface.
In subroutine 'init_grid2', the array 'tarea' is initialized in a loop on indices {ilo, ihi} and {jlo, jhi}, but is then used in subroutine 'makemask' in a loop on indices {1, nx_block} and {1, ny_block}, potentially causing an uninitialized value to be used (at line 1693). Initialize the whole array to zero first, to avoid using uninitialized values. The initialization to zero is actually commented out, following 10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1] , so uncomment it. [1] CICE-Consortium#180
In subroutine 'makemask', the array 'uvm' is initialized in a loop on indices {ilo, ihi} and {jlo, jhi}, but is then used in a loop on indices {1, nx_block} and {1, ny_block}, potentially causing an uninitialized value to be used (at line 1672). Initialize the whole array to zero first, to avoid using uninitialized values. The initialization to zero is actually commented out, following 10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1], so uncomment it. [1] CICE-Consortium#180
In subroutine 'init_coupler_flux', the initialization of the 'sst' array to 'Tf' is protected by a '#ifndef CICE_IN_NEMO' preprocessor directive. This has been the case since CICE-Consortium/CICE-svn-trunk@151b9af (cice-4.0~17, 2008-04-28), though that commit does not explain why. This is probably because in some conditions (depending on active CPPs at NEMO compilation time, as well as NEMO namelist parameters), the SST needs to be passed early to CICE, before calling CICE_Initialize (see the CICE coupling interface in NEMO [1]), and initializing it to 'Tf' in 'init_coupler_flux' would erase the values already passed from the ocean. If however the SST is *not* passed to CICE before CICE_Initialize is called, and 'CICE_IN_NEMO' is in use, then 'ice_init::set_state_var' reads from the uninitialized 'sst' array when placing ice where the ocean surface is cold. In ECCC's in-house version of the coupling interface (sbc_ice_cice), extensive changes have been made to work around bugs in the original version and limitations due to the sequence in which CICE and NEMO fields are initialized, such that we manually call 'ice_init::init_state' a second time at the end of subroutine sbc_ice_cice::cice_sbc_init. To avoid using a uninitialized array, remove the '#ifdef CICE_IN_NEMO' around the initialization of the 'sst' array to 'Tf', so that it is always initialized. These values will anyway be replaced by the "correct" ones when init_state is called a second time in our in-house version. [1] http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/releases/release-3.6/NEMOGCM/NEMO/OPA_SRC/SBC/sbcice_cice.F90#L176
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally fine. A few comments. |
Thanks for taking a look Till!
I think I agree. I did not look into it further when I saw that the initialization was commented, I just uncommented it. What you suggest seems more sane.
You mean that the initialization should not be necessary ? Well at least on our "CREG025-Pacific extension" grid, some values were uninitialized in the condition at line 1673 if I did not initialize
Yes, I looked at the changes to sbc_ice_cice in your nemo4cice project and they were mostly the same as mine, although our version has a lot of in-house modifications, that I don't yet fully understand. Sometimes stuff was implemented in a not so clean way because of lack of time, and it's one of the mid-term goals of my project to clean that up. We definitely will let you know. |
@@ -384,7 +384,7 @@ subroutine init_grid2 | |||
! T-grid cell and U-grid cell quantities | |||
!----------------------------------------------------------------- | |||
|
|||
! tarea(:,:,:) = c0 | |||
tarea(:,:,:) = c0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unnecessary as tarea is in a HaloUpdate shortly after it is computed. I don't think you want any zero values in tarea at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I agree that it should not be necessary, but as I wrote, if I do not initialize here, an uninitialized value gets used... So that would mean that not all halo cells are initialized by the halo update. I'll have to investigate that further on the specific grid we are using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be that your compiler is being very picky. What line does it abort on exactly with the uninitialized variable? I've done the check with NAG and Intel and it runs fine. As you say, it might be a particular feature of your grid. There was an issue awhile ago from one of the European groups that trapped this same error. I am fine with initializing uvm to zero, but tarea should never be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It aborts at line 1694 in makemask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. A cleaner initialization would be as follows:
do j = 1, ny_block
do i = 1, nx_block
im = max(i,ilo)
jm = max(j,jlo)
ip = min(i,ihi)
jp = min(j,jhi)
Then replace all of the i,j indexes on the right with ip/jp. This will extrapolate the tarea values to the ghost cells including the four corners. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, these should all be im/jm or ip/jp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the maskhalo is turned on, it's possible that is also masking some part of the initialization. Just speculation on my part, we'd have to look into it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the maskhalo is not being used for the halo update of tarea, which is correct, so forget I said that. I assume tarea is not initialized on the entire halo because of the land mask? We could also fix the loop in makemask, but that's likely to affect other loops in other parts of the code in the same way. This is almost certainly a case of variables not being initialized in places they are not needed, but where they are referenced at some point later. The proper fix, of consistently implementing all loops, is probably not worth the effort, but if we want to update a loop here or there, that's probably OK too. I'm not opposed to this type of fix in this case.
@@ -1636,7 +1636,7 @@ subroutine makemask | |||
!----------------------------------------------------------------- | |||
|
|||
bm = c0 | |||
! uvm = c0 | |||
uvm = c0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. The HaloUpdate should take care of uvm values in the ghost cells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It aborts at line 1673 in makemask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant the Lsub initialization. This is done in ice_forcing.F90.
@@ -384,7 +384,7 @@ subroutine init_grid2 | |||
! T-grid cell and U-grid cell quantities | |||
!----------------------------------------------------------------- | |||
|
|||
! tarea(:,:,:) = c0 | |||
tarea(:,:,:) = c0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be that your compiler is being very picky. What line does it abort on exactly with the uninitialized variable? I've done the check with NAG and Intel and it runs fine. As you say, it might be a particular feature of your grid. There was an issue awhile ago from one of the European groups that trapped this same error. I am fine with initializing uvm to zero, but tarea should never be zero.
The European group may have been DMI. I should look at the report Jacob wrote some time ago |
It was Issue #61 . I have reopened it and linked it here. |
@phil-blain, is testing complete, is this ready to merge? |
not ready to merge at all. I've yet to go back into the code and come up with a better solution for |
Ok I'm beginning to understand a little bit more. It seems it is due to the fact that the domain decomposition we are using has some padded blocks. I was able to plot So my guess at this point is that the halo updates are not correctly updating padded blocks ? The darker line around the edge of the plots have value 1.0, I don't know yet where this value is coming from... |
I think that 1. originates from the icehalo update |
Good catch. Cell padding has always been a headache for me. I try to avoid it. I think that the cells in the padded area are not even treated as ghost cells. So, that means halo updates do not touch them. However, the model should never reach in here. In my opinion, the compiler is just being pedantic about uninitialized variables here. |
If I understand the code correct then block size (nx_block and ny_block) are constant wheras jhi and ihi are adjusted to the global domain size. Then in the last block towards east and north it may run out of the the global domain when it loops from 1, ny_block or 1, nx_block. (padding from e.g ihi ot nx_block). |
I believe that is all correct. Each block has it's own ilo, ihi, jlo, jhi defined. That's what this is all about,
The padded areas should not be used in general. If they are, we might want to try to fix that. I assume the halos do work correctly on the padded blocks. We do have some bit-for-bit checks for different decomps, some padded. Is the padded cell halo index ihi+1 or is it still nx+1? I actually don't know. |
Not looking at the code, so someone needs to confirm this, but what I recall is that the padded cells start outside of the halo cells, and they should be treated as 'missing'. Entire missing (land) blocks are never accessed, but it's possible that 'missing' padded cells still get noticed by compilers, even if they are never accessed. If they are being accessed, then there's definitely a bug. |
Very good points. Don't we use nghost as we might have more than one ghost cell in each direction? The ilo, ihi, jlo, jhi and nx_block, ny_block are unique for each block including the padded areas. There could be loops that only go to +/- nghost relative to the lo/hi indices. The padded areas should still be "interior" cells, but just masked out right? This has always confused me. |
The padded cells are assigned a global index of zero, which can never occur for physical cells. This is how they are identified and avoided for boundary conditions, parallelization, etc. ilo:ihi, jlo:jhi index only the physical domain, and nx_block, ny_block are those physical cells plus ONE row or column of ghost cells [EDIT: on each side of the physical domain]. The code was written in a general way to allow more ghost cells, but this has never been tried and there are comments in the code saying that some routines will need to be updated to use more than one. Padded cells lie beyond nx_block, ny_block for the eastern- and northern-most blocks, and they should never be accessed in the main code, once the decomposition has been done. It's possible that the haloes along those edges aren't right (especially the two tripole forms), but the code has been exercised enough that I trust those pretty well. |
I agree. There are a lot of problems or inconsistencies in the code that we are gradually fixing, as we come across them. Let's definitely fix the loops that are causing the crashes in this case, and initialize stuff better if that's what's needed. Otherwise we can keep our eyes open for similar issues as new code modifications go in. As an aside, it would be interesting to know how much of the code has 'aged' versus getting recoded or modified somehow. Probably a lot of it turns over, so speak, every 10 years or so. Although the 'blame' button gives some info, it doesn't necessarily capture white space changes... I'm not proposing we figure this out! But I've also wondered how many lines of code we actually have (as opposed to comments and blank lines). Just curious. |
[I thought I had hit the "Comment" button on Friday, but is seems not!] Yeah, after thinking about this since I posted my response above I also think that fixing the loop indices where they need to be fixed is a better approach. This way, we should easily find other places in the code that need to be fixed, whereas initializing the padded cells will "hide" possible loop indices "errors" in those other places. I will also have a closer look at the halo update code to understand what's going on with the padded cells. |
I just ran the following test on cheyenne with 3 compiilers, smoke_gx3_8x2x8x10x20_debug_diag1_droundrobin_run2day and all pass. There is a long standing issue to update the debug flags to check more things and initialize data to Nans. @phil-blain, can you tell me what compiler flags you're using for your testing. This is with an intel compiler, correct? It might be worthwhile to expand this PR to extend the debug flags, add a debug padded test into the base suite, test with multiple compilers, and fix loops. I think that requires some comprehensive efforts, and I'm willing to take that on. Maybe we can include it by the end of the month release? I'm also happy to just merge this PR as a starting point to fix the issues that @phil-blain has run into. We can then continue addressing the issue moving forward as I've suggested. |
[@apcraig I was writing this as I received your answer, so I'll read it and answer separately] OK, some updates now: I ran the
so 5 tests are failing. Apart from first one ("dspacecurve" - see below ), they all abort at the same place in Next, I tried to change the loop indices for that loop in makemask : diff --git i/cicecore/cicedynB/infrastructure/ice_grid.F90 w/cicecore/cicedynB/infrastructure/ice_grid.F90
index 5308236..33cae7f 100644
--- i/cicecore/cicedynB/infrastructure/ice_grid.F90
+++ w/cicecore/cicedynB/infrastructure/ice_grid.F90
@@ -1684,8 +1684,14 @@ subroutine makemask
tarean(:,:,iblk) = c0
tareas(:,:,iblk) = c0
- do j = 1, ny_block
- do i = 1, nx_block
+ this_block = get_block(blocks_ice(iblk),iblk)
+ ilo = this_block%ilo
+ ihi = this_block%ihi
+ jlo = this_block%jlo
+ jhi = this_block%jhi
+
+ do j = jlo, jhi
+ do i = ilo, ihi
if (ULAT(i,j,iblk) >= -puny) lmask_n(i,j,iblk) = .true. ! N. Hem.
if (ULAT(i,j,iblk) < -puny) lmask_s(i,j,iblk) = .true. ! S. Hem. Here I used I then reran CICE/cicecore/cicedynB/infrastructure/ice_grid.F90 Lines 1922 to 1941 in c136267
so when accessing One last thing, about the "dspacecurve" test: it's failing in a different way, so this might be a different bug, but just to document it here (we might want ot opne a separate issue for that):
|
@apcraig yes, this is with Intel. The flags that I use are:
GCC/GFortran should have equivalent flag that I do not remember off the top of my head. I think we should fix the loops (or the halo updates if that is the problem - although as @JFLemieux73 and I were discussing offline last Friday, if the decomp suite is bit for bit, I have trouble understanding how they could be wrong... meaning that if we are accessing non-initialized locations and still getting bit-for-bit results even between padded and non-padded decompositions, then maybe it does not influence the outputs in the end somehow...) Anyway I think it's better to not do a band-aid |
I think we're converging on a plan. Lets try not to duplicate efforts, do you want me to look into the halo update? Or how can I help best? The spacecurve thing has always been a little dicey in it's implementation. It might be better as a separate issue. I think it depends what tests we plan to add. I also agree that the fact that we're getting bit-for-bit answers with different decomposition strongly suggests these problems are not influencing model results. I think that's allowed us to get away with some of this. |
spacecurve failed the same way when I used machine freya and debug On topic I think that it is dangerous to initialize to 0. If anything then a huge number in order to make the simulation fail in case that someone creates a loop that enters the padding zone or other areas that are not iniitialized. As you have agreed upon I much prefer not to initialize and then fix the problem when issues are found. |
I have duplicated @phil-blain's results. I have also looked at the halo implementation, at least initially. The halo points are always next to the ilo/ihi/jlo/jhi indices. One thing is that the fill in the halo seems to operate assuming no padding,
should probably be
I'm looking at the error in the to_ugrid subroutine now. |
I think that the 1:ny_block should also be changed to jlo-nghost:jhi+nghost and the same in the other dimension |
@TillRasmussen, that's correct. The protocode above is wrong. What's even more difficult is that we don't know ilo/ihi/jlo/jhi in the ice_boundary.F90 where the halo update is done and prefilled, and I'm not sure we can query it because we'll end up with circular logic. The halo does a bunch of precomputation based on neighbor info and global info, but I'm not sure there is an awareness of it's local list of blocks. The problem is that you want to fill the entire halo with the fill value, not just the part of the halo that's going to be filled by the haloupdate. |
So, I'm pretty sure that the bug in the fill setting in ice_boundary is causing the problems in the to_ugrid. Unfortunately, I haven't found a way to fix that bug yet because we don't have access to the list of local blocks, so that will take a bit more work. I can create an issue. I have duplicated all of @phil-blain's issues and fixes. I think we should proceed as follows
@phil-blain, I can do a bunch of this and then PR a change onto your branch. Or if you prefer to delete this PR, I can create a new one. We could also merge this PR with what you like and then I can create a new PR with some additional stuff for cheyenne and maybe some new tests. I do want to test across the full test suite on cheyenne and cheyenne is going to be down the next 2 days, so that will slow me down a bit. But I'll start to expand my testing and see how that goes. |
It looks like some of these changes in makemask are changing answers. I need to dig a bit further. |
I've extended this PR in a branch in my fork and added some new tests turning on debug with various decompositions, updated cheyenne debug to add signaling nans, and fixed the spacecurve bug. I also did a bunch of validation testing for bit-for-bit. There are several subtle issues. I will run a full test suite soon and then maybe we can merge my changes to this branch or create a new PR. I am going to look at #572 again to see if I can get that implemented properly on the branch. If I can figure it out quickly, I'll include it. Otherwise, we can defer it a bit. |
Thanks Tony. Sorry for not being very responsive - I'm only working half days and have a lot going on both at and outside work.... Might be simpler in a new PR. Let's see what you come up with and then we can decide which commits we keep in which PR. If you update the machines files, could you add the signaling NaNs also for Nice to have the spacecurve bug also fixed! |
@phil-blain, sounds good and will do. |
I ran a full test suite with the "full" debug options and there are a bunch of other failing tests now. I have fixed a couple things, but can't take everything on at the moment. I will create a new PR and we need to continue to work on this over time. I am going to back off on the debug compiler options on cheyenne until all the tests can pass. I will update the issue related to extending the debug compiler options as well. |
Closing for #579 |
PR checklist
cicecore: fix some use of uninitialized variables
P.Blain
in progress
This is the first one of a series of small PRs I'm preparing to upstream some changes I've had to make to CICE6 during my work on coupling with NEMO3.6. This one deals with use of uninitialized values, which I've detected by compiling with
-init=snan,arrays
(Intel). I will start the base_suite right away and update the description with the test results once they are finished running.See the commit messages for complete details.