LRU Caching for NetCDF lumped format timeslice chunks #421

mattw-nws · 2022-06-21T14:09:07Z

Caches slices of data in NetCDFPerFeatureDataProvider for performance improvement.

Additions

Test param files (not real params!) for running cat-52 and cat-67 in addition to cat-27 and a realization config for testing all three with NetCDF lumped forcing sample file.

Removals

Changes

Boost-supplied LRU cache implementation caches whole slices of NetCDF, all catchments for a single timestep for each variable gets a cache entry. This improves performance for forcing reads dramatically.
Slightly improved lumped NetCDF generation script allows more flexibility in CSV files (capitalization of Time field) and extracts catchment ID more inteligently.

Testing

Validated existing NetCDFPerFeatureDataProvider tests.

Screenshots

Notes

Todos

Checklist

Testing checklist (automated report can be put here)

Target Environment support

Linux

…2d5f8 .

…ike 10 should be sufficient but this causes horrible thrashing.

stcui007 · 2022-07-19T20:03:18Z

Only some superficial reading of the codes so far. One note on commit "Tweaks to allow creation from either "standard" CSV format": one line 29, should the second parameter be 1 instead of 2, as noted somewhere in a C++ code?
chunksizes=(num_catchments,2)

mattw-nws · 2022-07-19T20:39:26Z

@stcui007 Yes, correct--this is adjusted in 440d572.

stcui007 · 2022-07-21T16:02:10Z

One comment I have involves NetCDFPerFeatureDataProvider.hpp lines 401 - 424. In particular, I don't fully understand line 422. I assume i iterate through cache slices, while j varies within a slice? Then for raw_values[i+j] the indices i and j would be equivalent, i.e., (i=0, j=1) and (i=1, j=0) would be the same. Would this potentially mix up different slices? In the current set setup, since cache_slice_t_size = 1, this won't have any effect. Is this going to be generalized?

Are the comment lines between lines 67 - 69, and 74 - 75 still necessary?

One other question doesn't have anything to do with the current codes, but with the format of the netcdf files. The original format of the AORC netcdf forcing files' attributes contain the scale_factor and offset. If we deal with that in pre-processing, then we don't need to deal with that here. Otherwise, it needs to be considered.

mattw-nws · 2022-07-21T19:54:15Z

One comment I have involves NetCDFPerFeatureDataProvider.hpp lines 401 - 424. In particular, I don't fully understand line 422. I assume i iterate through cache slices, while j varies within a slice? Then for raw_values[i+j] the indices i and j would be equivalent, i.e., (i=0, j=1) and (i=1, j=0) would be the same. Would this potentially mix up different slices? In the current set setup, since cache_slice_t_size = 1, this won't have any effect. Is this going to be generalized?

Yes, this is probably flawed right now for any case where cache_slice_t_size != 1. That should probably be something like [(i*cache_slice_t_size)+j] but that's not quite right either I don't think. More work will be required to correctly support time slices greater than 1... I think catchment slices greater than 1 will be more likely needed, if this implementation survives long enough.

mattw-nws · 2022-07-21T19:56:52Z

Are the comment lines between lines 67 - 69, and 74 - 75 still necessary?

No, but they might be needed at larger scales. Left in for reference.

mattw-nws · 2022-07-21T19:58:57Z

One other question doesn't have anything to do with the current codes, but with the format of the netcdf files. The original format of the AORC netcdf forcing files' attributes contain the scale_factor and offset. If we deal with that in pre-processing, then we don't need to deal with that here. Otherwise, it needs to be considered.

Good question... my assumption was that the library handles these attributes, but that may be incorrect. If we support those in these files, we would need to test that... however these files have float values so those metadata scale values are far less useful.

stcui007 · 2022-07-21T20:05:05Z

This question comes up because in my ESMPY based codes the output netcdf files copy the original attributes. If we don't need this, I'll need to modify the output netcdf file attributes.

…

On Thu, Jul 21, 2022 at 2:59 PM Matt Williamson ***@***.***> wrote: One other question doesn't have anything to do with the current codes, but with the format of the netcdf files. The original format of the AORC netcdf forcing files' attributes contain the scale_factor and offset. If we deal with that in pre-processing, then we don't need to deal with that here. Otherwise, it needs to be considered. Good question... my assumption was that the library handles these attributes, but that may be incorrect. *If* we support those in these files, we would need to test that... however these files have float values so those metadata scale values are far less useful. — Reply to this email directly, view it on GitHub <#421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA4SRKVACBVIEEXTSRRPQTVVGTYXANCNFSM5ZMPVSEQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stcui007

Comment already provided in PR.

mattw-nws · 2022-07-22T13:18:42Z

The original format of the AORC netcdf forcing files' attributes contain the scale_factor and offset...This question comes up because in my ESMPY based codes the output netcdf files copy the original attributes

Those attributes should only be copied if the values in the NetCDF need to be scaled and offset... do they? Are the values copied into the new NetCDF (floats) already scaled and offset? Or are they the same values from the original NetCDF before they are scaled and offset?

donaldwj · 2022-07-22T14:01:01Z

I believe scale and offset are used to convert integer encoded float back into float automatically. If we are getting floating point values for those fields then these values have been applied. We would only need to duplicate the values if we where also story as integer encoded float. Also the correct scale and offset values would be different for each field in any case.

…

On Fri, Jul 22, 2022 at 8:19 AM Matt Williamson ***@***.***> wrote: The original format of the AORC netcdf forcing files' attributes contain the scale_factor and offset...This question comes up because in my ESMPY based codes the output netcdf files copy the original attributes Those attributes should only be copied if the values in the NetCDF need to be scaled and offset... do they? Are the values copied into the new NetCDF (floats) already scaled and offset? Or are they the same values from the original NetCDF *before* they are scaled and offset? — Reply to this email directly, view it on GitHub <#421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF6KABAMG47ATBC2WVBNLSTVVKNT3ANCNFSM5ZMPVSEQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

-- Donald W Johnson 205-347-1467 National Water Center Tuscaloosa AL

stcui007 · 2022-07-22T14:40:31Z

If you just read the netcdf file, the scale factor and offset are handled automatically, based on my reading. So you should have the correct values when you read in the file, unless you do anything else. On Fri, Jul 22, 2022 at 9:01 AM Donald W Johnson -- NOAA < ***@***.***> wrote:

…

I believe scale and offset are used to convert integer encoded float back into float automatically. If we are getting floating point values for those fields then these values have been applied. We would only need to duplicate the values if we where also story as integer encoded float. Also the correct scale and offset values would be different for each field in any case. On Fri, Jul 22, 2022 at 8:19 AM Matt Williamson ***@***.***> wrote: > The original format of the AORC netcdf forcing files' attributes contain > the scale_factor and offset...This question comes up because in my ESMPY > based codes the output netcdf files copy the original attributes > > Those attributes should only be copied if the values in the NetCDF need to > be scaled and offset... do they? Are the values copied into the new NetCDF > (floats) already scaled and offset? Or are they the same values from the > original NetCDF *before* they are scaled and offset? > > — > Reply to this email directly, view it on GitHub > <#421 (comment)>, or > unsubscribe > < https://github.com/notifications/unsubscribe-auth/AF6KABAMG47ATBC2WVBNLSTVVKNT3ANCNFSM5ZMPVSEQ > > . > You are receiving this because your review was requested.Message ID: > ***@***.***> > -- Donald W Johnson 205-347-1467 National Water Center Tuscaloosa AL — Reply to this email directly, view it on GitHub <#421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA4SRPLEBM7HGDHAU42NBDVVKSSRANCNFSM5ZMPVSEQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stcui007 · 2022-07-22T14:50:35Z

Also, if you copy the attributes including scale_factor and offset, the output values in netcdf file also take into account these automatically. On Fri, Jul 22, 2022 at 9:40 AM Shengting Cui - NOAA Affiliate < ***@***.***> wrote:

…

If you just read the netcdf file, the scale factor and offset are handled automatically, based on my reading. So you should have the correct values when you read in the file, unless you do anything else. On Fri, Jul 22, 2022 at 9:01 AM Donald W Johnson -- NOAA < ***@***.***> wrote: > I believe scale and offset are used to convert integer encoded float back > into float automatically. If we are getting floating point values for > those > fields then these values have been applied. We would only need to > duplicate > the values if we where also story as integer encoded float. Also the > correct scale and offset values would be different for each field in any > case. > > On Fri, Jul 22, 2022 at 8:19 AM Matt Williamson ***@***.***> > wrote: > > > The original format of the AORC netcdf forcing files' attributes contain > > the scale_factor and offset...This question comes up because in my ESMPY > > based codes the output netcdf files copy the original attributes > > > > Those attributes should only be copied if the values in the NetCDF need > to > > be scaled and offset... do they? Are the values copied into the new > NetCDF > > (floats) already scaled and offset? Or are they the same values from the > > original NetCDF *before* they are scaled and offset? > > > > — > > Reply to this email directly, view it on GitHub > > <#421 (comment)>, or > > unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AF6KABAMG47ATBC2WVBNLSTVVKNT3ANCNFSM5ZMPVSEQ > > > > . > > You are receiving this because your review was requested.Message ID: > > ***@***.***> > > > > > -- > Donald W Johnson > 205-347-1467 > National Water Center > Tuscaloosa AL > > — > Reply to this email directly, view it on GitHub > <#421 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACA4SRPLEBM7HGDHAU42NBDVVKSSRANCNFSM5ZMPVSEQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

mattw-nws · 2022-07-22T15:14:05Z

@donaldwj @stcui007 Here's what's in the latest file example, via ncdump:

        float T2D(catchment-id, time) ;
                T2D:_FillValue = -99999.f ;
                T2D:missing_value = -32767.f ;
                T2D:long_name = "Temperature" ;
                T2D:short_name = "TMP_2maboveground" ;
                T2D:scale_factor = 0.1f ;
                T2D:units = "K" ;
                T2D:level = "2 m above ground" ;

and

 T2D =
  {2930.423, 2920.427, 2910.339, 2900.349, 2894.427, 2888.816, 2882.97, 
    2879.969, 2876.969, 2873.97},
  {2951.988, 2938.251, 2924.16, 2910.149, 2898.901, 2887.482, 2875.898, 
    2871.486, 2866.829, 2862.471},

First, is this valid, then? Will these floats be rescaled by the library when read? Second, I don't think that's what we want, if they're floats. There's no cost to storing the values without scaling if they're floats, I don't think--it's only a source of confusion.

stcui007 · 2022-07-22T15:20:43Z

It is correct. If you look at the attributes, T2D has a scale factor 0.1. If I copy the original attributes where there is a scale factor, it automatically does that. If we don't want that, we need to change the attributes.

…

On Fri, Jul 22, 2022 at 10:14 AM Matt Williamson ***@***.***> wrote: @donaldwj <https://github.com/donaldwj> @stcui007 <https://github.com/stcui007> Here's what's in the latest file example, via ncdump: float T2D(catchment-id, time) ; T2D:_FillValue = -99999.f ; T2D:missing_value = -32767.f ; T2D:long_name = "Temperature" ; T2D:short_name = "TMP_2maboveground" ; T2D:scale_factor = 0.1f ; T2D:units = "K" ; T2D:level = "2 m above ground" ; and T2D = {2930.423, 2920.427, 2910.339, 2900.349, 2894.427, 2888.816, 2882.97, 2879.969, 2876.969, 2873.97}, {2951.988, 2938.251, 2924.16, 2910.149, 2898.901, 2887.482, 2875.898, 2871.486, 2866.829, 2862.471}, First, is this valid, then? Will these floats be rescaled by the library when read? Second, I don't think that's what we want, if they're floats. There's no cost to storing the values without scaling if they're floats, I don't think--it's only a source of confusion. — Reply to this email directly, view it on GitHub <#421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA4SROWDGVFJY7S2GUKR2LVVK3ERANCNFSM5ZMPVSEQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mattw-nws force-pushed the netcdf-lumped-optimized branch from 4daa471 to 637cc16 Compare June 23, 2022 15:26

mattw-nws pushed a commit to mattw-nws/ngen that referenced this pull request Jul 12, 2022

Fix build errors... this will conflict with NOAA-OWP#421

5c7bf1e

mattw-nws pushed a commit to mattw-nws/ngen that referenced this pull request Jul 12, 2022

Fix test builds. This may conflict with NOAA-OWP#421 and 7df05f3 / 4b…

c678337

…2d5f8 .

mattw-nws force-pushed the netcdf-lumped-optimized branch 2 times, most recently from d71ffdb to 22c4974 Compare July 15, 2022 17:09

mattw-nws requested a review from donaldwj July 15, 2022 19:00

mattw-nws force-pushed the netcdf-lumped-optimized branch from 80f1654 to a5e594e Compare July 15, 2022 19:12

mattw-nws pushed a commit that referenced this pull request Jul 19, 2022

Fix build errors... this will conflict with #421

53fa141

mattw-nws pushed a commit that referenced this pull request Jul 19, 2022

Fix test builds. This may conflict with #421 and 7df05f3 / 4b2d5f8 .

ade90bd

mattw-nws force-pushed the netcdf-lumped-optimized branch from 5b9f13e to c7db0e7 Compare July 19, 2022 13:27

Matt Williamson and others added 10 commits July 19, 2022 09:33

Shared DataProvider per file path per process

5588ef7

update cmake file to remove build errors on test code.

23f306c

Fixes for test builds

10c178d

Tweaks to allow creation from either "standard" CSV format

b307989

Draft files for test development

8879bc9

Initial LRU cache for time slice chunks

f05ae52

Tuned cache size to 20, which seems sufficient. Strangely, it seems l…

64a1944

…ike 10 should be sufficient but this causes horrible thrashing.

bugfix for failing test--index vectors were not reset in loop.

b08bf5f

Optimized: was retrieving 2x necessary slices in common case.

4f52e92

Cleanup

2e9ed48

mattw-nws force-pushed the netcdf-lumped-optimized branch from c7db0e7 to 2e9ed48 Compare July 19, 2022 14:36

Default chunksize should be 1 timestep not 2

440d572

mattw-nws marked this pull request as ready for review July 19, 2022 14:50

mattw-nws requested review from hellkite500 and stcui007 July 19, 2022 14:50

stcui007 approved these changes Jul 21, 2022

View reviewed changes

donaldwj approved these changes Jul 28, 2022

View reviewed changes

mattw-nws merged commit f7b361f into NOAA-OWP:master Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LRU Caching for NetCDF lumped format timeslice chunks #421

LRU Caching for NetCDF lumped format timeslice chunks #421

mattw-nws commented Jun 21, 2022 •

edited

Loading

stcui007 commented Jul 19, 2022

mattw-nws commented Jul 19, 2022

stcui007 commented Jul 21, 2022

mattw-nws commented Jul 21, 2022

mattw-nws commented Jul 21, 2022

mattw-nws commented Jul 21, 2022

stcui007 commented Jul 21, 2022 via email

stcui007 left a comment

mattw-nws commented Jul 22, 2022

donaldwj commented Jul 22, 2022 via email

stcui007 commented Jul 22, 2022 via email

stcui007 commented Jul 22, 2022 via email

mattw-nws commented Jul 22, 2022

stcui007 commented Jul 22, 2022 via email

LRU Caching for NetCDF lumped format timeslice chunks #421

LRU Caching for NetCDF lumped format timeslice chunks #421

Conversation

mattw-nws commented Jun 21, 2022 • edited Loading

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist (automated report can be put here)

Target Environment support

stcui007 commented Jul 19, 2022

mattw-nws commented Jul 19, 2022

stcui007 commented Jul 21, 2022

mattw-nws commented Jul 21, 2022

mattw-nws commented Jul 21, 2022

mattw-nws commented Jul 21, 2022

stcui007 commented Jul 21, 2022 via email

stcui007 left a comment

Choose a reason for hiding this comment

mattw-nws commented Jul 22, 2022

donaldwj commented Jul 22, 2022 via email

stcui007 commented Jul 22, 2022 via email

stcui007 commented Jul 22, 2022 via email

mattw-nws commented Jul 22, 2022

stcui007 commented Jul 22, 2022 via email

mattw-nws commented Jun 21, 2022 •

edited

Loading