Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support datasets with differently chunked variables in DatasetToChunks #50

Closed
wants to merge 0 commits into from

Conversation

copybara-service[bot]
Copy link

@copybara-service copybara-service bot commented Aug 31, 2022

Support datasets with differently chunked variables in DatasetToChunks

There are two major internal changes:

  1. Key objects from DatasetToChunks now can include different dimensions for different variables when using split_vars=True. This makes it easier to handle large datasets with many variables and different chunking per variable.
  2. Inputs inside the DatasetToChunks pipeline can now be sharded across many tasks. This is important for scalability to large datasets, especially with this chagne because the above refactor increases the number of inputs by the number of variables when split_vars=True. Otherwise, we can run into performance issues on the machine launching the pipeline when the number of inputs goes into the millions (e.g., slow speed, out of memory).

See the new integration test for a concrete use-case, resembling real model output.

Also revise the warning message in the README to be a bit friendlier.

Fixes #43

@copybara-service copybara-service bot force-pushed the test_471347485 branch 3 times, most recently from b2940ec to b607f99 Compare September 1, 2022 21:14
@copybara-service copybara-service bot changed the title Omit unchunked dimensions from Key objects created with DatasetToChunks Support datasets with differently chunked variables in DatasetToChunks Sep 1, 2022
@copybara-service copybara-service bot force-pushed the test_471347485 branch 6 times, most recently from dd24059 to 53e33d2 Compare September 3, 2022 04:39
@copybara-service copybara-service bot closed this Sep 3, 2022
@copybara-service copybara-service bot deleted the test_471347485 branch September 3, 2022 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider omitting unchunked dimensions from Key objects created with DatasetToChunks
0 participants