Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_icechunk for xarray #357

Merged
merged 43 commits into from
Nov 26, 2024
Merged

Add to_icechunk for xarray #357

merged 43 commits into from
Nov 26, 2024

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Oct 29, 2024

Adds icechunk.xarray.to_icechunk.

TODO:

I chose to not update dask_write.py because it does show a different pattern that uses dask as an orchestrator and not dask.array.

@dcherian
Copy link
Contributor Author

dcherian commented Oct 31, 2024

This adds two APIs. One is "easy" and does everything in one step, the other allows extra control over initializing the store, and subsequent writes.

  1. icechunk.xarray.to_icechunk(dataset, store=store, ...) which is like ds.to_zarr without the compute kwarg.
  2. To "initialize" the repo, and execute a distributed writer in a separate step, the API is:
from icechunk.xarray import XarrayDatasetWriter

writer = XarrayDatasetWriter(ds, store=store)
writer.write_metadata(group="new2", mode="w") # write metadata
writer.write_eager() # write in-memory arrays
writer.write_lazy() # eagerly write dask arrays

The write_eager method could execute an async write of all the eager in-memory coordinate arrays, but does not do so right now.


Write metadata for arrays in one step. This "initializes" the store but has not written an real values yet.
```python
writer.write_metadata(group="new2", mode="w")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about appends & region writes here.

@TomNicholas
Copy link
Contributor

Naive question: if icechunk.xarray.to_icechunk is functionally the same as ds.to_zarr(icechunkstore) then why does it need to exist?

Also should we call icechunk.xarray.to_icechunk from inside virtualizarr's .virtualize.to_icechunk accessor method? Since that can also write non-virtual variables.

cc @mpiannucci

@dcherian
Copy link
Contributor Author

dcherian commented Nov 5, 2024

is functionally the same as

It is not. to_zarr will not work for distributed writes. It would write the chunks, but the commit wouldn't be meaningful.

@TomNicholas
Copy link
Contributor

.to_zarr will not work for distributed writes. It would write the chunks, but the commit wouldn't be meaningful.

Can you expand on what this means in practice? If I have a big distributed dask array, I call to .to_zarr, then commit, what exactly would go wrong?

@dcherian
Copy link
Contributor Author

Can you expand on what this means in practice?

The store object you are holding has no record of the distributed writes because those changes were never communicated back. So you'll receive all fill values when trying to read: #383

@dcherian dcherian marked this pull request as ready for review November 22, 2024 22:06
latest_layer = f"{aggprefix}-{depth}-{token}"

layers[latest_layer] = partial_reduce(
aggregate, keys, layer_name=latest_layer, split_every=split_every
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥵

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a masterpiece Deepak!

Reviewed the code and tests and didn't spot any red flags. I think I get the gist of how it works, but I'm not confident I'd be able to identify any bugs there anyway. The tests inspire great confidence.

I left a bunch of suggestions on the docs.

docs/docs/icechunk-python/dask.md Outdated Show resolved Hide resolved
docs/docs/icechunk-python/dask.md Show resolved Hide resolved
docs/docs/icechunk-python/dask.md Outdated Show resolved Hide resolved
docs/docs/icechunk-python/dask.md Outdated Show resolved Hide resolved
```
Note that the chunks in the store are a divisor of the dask chunks. This means each individual
write task is independent, and will not conflict. It is your responsibility to ensure that such
conflicts are avoided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've done so much work in Xarray around safe chunks, region writes, etc. It's a pity that we can't reuse any of that here.

docs/docs/icechunk-python/dask.md Outdated Show resolved Hide resolved
docs/docs/icechunk-python/dask.md Outdated Show resolved Hide resolved
icechunk-python/python/icechunk/dask.py Outdated Show resolved Hide resolved
icechunk-python/tests/test_xarray.py Outdated Show resolved Hide resolved
@dcherian dcherian merged commit 674864d into main Nov 26, 2024
2 checks passed
@dcherian dcherian deleted the dask-distributed branch November 26, 2024 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants