Performance of subsetting perturbation timeseries dataset #129

SorooshMani-NOAA · 2024-01-10T16:22:03Z

The combined dataset for all perturbation runs is a 3D dataset with dimensions (run, member no. and time). In order to process is using the existing KLPC method, we first stack the data into 2D (as shown in the borrowed snapshot below)

Passing this 2D dataset to the subsetting function fails to generate results, even if we just take 2 steps.

Ticket created from https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1876292717
Test code at https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1877314968

SorooshMani-NOAA · 2024-01-10T19:19:55Z

The issue seems to be related to how the stacking operation works. This is how we currently stack:

subset_time_stack = subset_time_chunk.stack(node=('time','nSCHISM_hgrid_node'))

The stacking operation of the two dims, by default, creates a multi-index dimension referring to the original stacked dimensions: in this case tuples of (time, schism_index).
https://docs.xarray.dev/en/latest/generated/xarray.DataArray.stack.html

During the subsetting call on this stacked dataset, the node dimension operations will be different from the normal case when it's just numbers (now it's tuples of dates and numbers).

In specific the performance bottleneck seems to be during processing of the element in the dataset due to isin check. The right fix is (needs testing) to set the stacked coordinate to just be equal to the repeated nodes array. We can also explore what happens if we simply remove the element from the dataset (i.e. is this the only bottleneck?)

SorooshMani-NOAA · 2024-01-10T20:28:15Z

Example of how to do the stacking:

ds = xr.Dataset()
ds = ds.assign_coords(node=np.array([1,2,3,1,2,3]), time=np.arange(7))
ds['v'] = (('node', 'time'), np.arange(42).reshape(6,7))
ds = ds.stack(stacked=('time','node'), create_index=False).swap_dims(stacked='node')

Still I think since we want to reshape back the whole thing at some point, maybe we need to make sure we don't drop any of the times(?). Either that, or when we reshape back, we can use the still existing time vector which is now dependent on node

Current working code:

subset_time_stack = subset_time_chunk.rename(
         nSCHISM_hgrid_node='node'
).stack(
         stacked=('time','node'), create_index=False
).swap_dims(
         stacked='node'
)

FariborzDaneshvar-NOAA · 2024-01-17T19:05:00Z

Thanks @SorooshMani-NOAA looks like this approach is working.
Here are execution time of subset_dataset() function for a subset of timeseries based on nodes that will be wet at least once (223,482 out of 701,988):

two timesteps stacked: ~1 min
entire runtime (468 timesteps) stacked: ~11 min

Execution of subset_dataset() function for all 701,988 nodes (instead of 223,482 nodes from max elev. subset), completed in ~26 min., but with warning messages regarding large graph sizes.

Do you recommend conducting two step subsetting (first based on max elevation, and then timeseries), or one single step?

SorooshMani-NOAA · 2024-01-18T16:45:53Z

@FariborzDaneshvar-NOAA based on our discussion yesterday, do you think we should keep this open or can we close it?

FariborzDaneshvar-NOAA · 2024-01-18T17:09:59Z

@SorooshMani-NOAA The subsetting part is working fine, so I guess you can close this ticket. The memory issue for surrogate expansion can be addressed separately.

SorooshMani-NOAA · 2024-01-18T17:20:10Z

Let's address that separately, since this was related to being able to subset, that one is related to analyzing the data.

SorooshMani-NOAA self-assigned this Jan 10, 2024

SorooshMani-NOAA mentioned this issue Jan 10, 2024

Consider timeseries for building the surrogate model #108

Open

SorooshMani-NOAA changed the title ~~Subset perturbation timeseries dataset~~ Performance of subsetting perturbation timeseries dataset Jan 11, 2024

SorooshMani-NOAA closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of subsetting perturbation timeseries dataset #129

Performance of subsetting perturbation timeseries dataset #129

SorooshMani-NOAA commented Jan 10, 2024 •

edited

Loading

SorooshMani-NOAA commented Jan 10, 2024

SorooshMani-NOAA commented Jan 10, 2024 •

edited

Loading

FariborzDaneshvar-NOAA commented Jan 17, 2024

SorooshMani-NOAA commented Jan 18, 2024

FariborzDaneshvar-NOAA commented Jan 18, 2024

SorooshMani-NOAA commented Jan 18, 2024

Performance of subsetting perturbation timeseries dataset #129

Performance of subsetting perturbation timeseries dataset #129

Comments

SorooshMani-NOAA commented Jan 10, 2024 • edited Loading

SorooshMani-NOAA commented Jan 10, 2024

SorooshMani-NOAA commented Jan 10, 2024 • edited Loading

FariborzDaneshvar-NOAA commented Jan 17, 2024

SorooshMani-NOAA commented Jan 18, 2024

FariborzDaneshvar-NOAA commented Jan 18, 2024

SorooshMani-NOAA commented Jan 18, 2024

SorooshMani-NOAA commented Jan 10, 2024 •

edited

Loading

SorooshMani-NOAA commented Jan 10, 2024 •

edited

Loading