Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of subsetting perturbation timeseries dataset #129

Closed
SorooshMani-NOAA opened this issue Jan 10, 2024 · 6 comments
Closed

Performance of subsetting perturbation timeseries dataset #129

SorooshMani-NOAA opened this issue Jan 10, 2024 · 6 comments
Assignees

Comments

@SorooshMani-NOAA
Copy link
Collaborator

SorooshMani-NOAA commented Jan 10, 2024

The combined dataset for all perturbation runs is a 3D dataset with dimensions (run, member no. and time). In order to process is using the existing KLPC method, we first stack the data into 2D (as shown in the borrowed snapshot below)

Image

Passing this 2D dataset to the subsetting function fails to generate results, even if we just take 2 steps.

Ticket created from https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1876292717
Test code at https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1877314968

@SorooshMani-NOAA
Copy link
Collaborator Author

The issue seems to be related to how the stacking operation works. This is how we currently stack:

subset_time_stack = subset_time_chunk.stack(node=('time','nSCHISM_hgrid_node'))

The stacking operation of the two dims, by default, creates a multi-index dimension referring to the original stacked dimensions: in this case tuples of (time, schism_index).
https://docs.xarray.dev/en/latest/generated/xarray.DataArray.stack.html

During the subsetting call on this stacked dataset, the node dimension operations will be different from the normal case when it's just numbers (now it's tuples of dates and numbers).

In specific the performance bottleneck seems to be during processing of the element in the dataset due to isin check. The right fix is (needs testing) to set the stacked coordinate to just be equal to the repeated nodes array. We can also explore what happens if we simply remove the element from the dataset (i.e. is this the only bottleneck?)

@SorooshMani-NOAA
Copy link
Collaborator Author

SorooshMani-NOAA commented Jan 10, 2024

Example of how to do the stacking:

ds = xr.Dataset()
ds = ds.assign_coords(node=np.array([1,2,3,1,2,3]), time=np.arange(7))
ds['v'] = (('node', 'time'), np.arange(42).reshape(6,7))
ds = ds.stack(stacked=('time','node'), create_index=False).swap_dims(stacked='node')

Still I think since we want to reshape back the whole thing at some point, maybe we need to make sure we don't drop any of the times(?). Either that, or when we reshape back, we can use the still existing time vector which is now dependent on node

Current working code:

subset_time_stack = subset_time_chunk.rename(
         nSCHISM_hgrid_node='node'
).stack(
         stacked=('time','node'), create_index=False
).swap_dims(
         stacked='node'
)

@SorooshMani-NOAA SorooshMani-NOAA changed the title Subset perturbation timeseries dataset Performance of subsetting perturbation timeseries dataset Jan 11, 2024
@FariborzDaneshvar-NOAA
Copy link
Collaborator

Thanks @SorooshMani-NOAA looks like this approach is working.
Here are execution time of subset_dataset() function for a subset of timeseries based on nodes that will be wet at least once (223,482 out of 701,988):

  • two timesteps stacked: ~1 min
  • entire runtime (468 timesteps) stacked: ~11 min

Execution of subset_dataset() function for all 701,988 nodes (instead of 223,482 nodes from max elev. subset), completed in ~26 min., but with warning messages regarding large graph sizes.

Do you recommend conducting two step subsetting (first based on max elevation, and then timeseries), or one single step?

@SorooshMani-NOAA
Copy link
Collaborator Author

@FariborzDaneshvar-NOAA based on our discussion yesterday, do you think we should keep this open or can we close it?

@FariborzDaneshvar-NOAA
Copy link
Collaborator

@SorooshMani-NOAA The subsetting part is working fine, so I guess you can close this ticket. The memory issue for surrogate expansion can be addressed separately.

@SorooshMani-NOAA
Copy link
Collaborator Author

Let's address that separately, since this was related to being able to subset, that one is related to analyzing the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants