-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of subsetting perturbation timeseries dataset #129
Comments
The issue seems to be related to how the stacking operation works. This is how we currently stack: subset_time_stack = subset_time_chunk.stack(node=('time','nSCHISM_hgrid_node')) The stacking operation of the two dims, by default, creates a multi-index dimension referring to the original stacked dimensions: in this case tuples of During the subsetting call on this stacked dataset, the In specific the performance bottleneck seems to be during processing of the |
Example of how to do the stacking: ds = xr.Dataset()
ds = ds.assign_coords(node=np.array([1,2,3,1,2,3]), time=np.arange(7))
ds['v'] = (('node', 'time'), np.arange(42).reshape(6,7))
ds = ds.stack(stacked=('time','node'), create_index=False).swap_dims(stacked='node') Still I think since we want to reshape back the whole thing at some point, maybe we need to make sure we don't drop any of the times(?). Either that, or when we reshape back, we can use the still existing Current working code: subset_time_stack = subset_time_chunk.rename(
nSCHISM_hgrid_node='node'
).stack(
stacked=('time','node'), create_index=False
).swap_dims(
stacked='node'
) |
Thanks @SorooshMani-NOAA looks like this approach is working.
Execution of Do you recommend conducting two step subsetting (first based on max elevation, and then timeseries), or one single step? |
@FariborzDaneshvar-NOAA based on our discussion yesterday, do you think we should keep this open or can we close it? |
@SorooshMani-NOAA The subsetting part is working fine, so I guess you can close this ticket. The memory issue for surrogate expansion can be addressed separately. |
Let's address that separately, since this was related to being able to subset, that one is related to analyzing the data. |
The combined dataset for all perturbation runs is a 3D dataset with dimensions (run, member no. and time). In order to process is using the existing KLPC method, we first stack the data into 2D (as shown in the borrowed snapshot below)
Passing this 2D dataset to the subsetting function fails to generate results, even if we just take 2 steps.
Ticket created from https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1876292717
Test code at https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1877314968
The text was updated successfully, but these errors were encountered: