Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunks management with datetime64 and timedelta64 datatype #8230

Closed
1 of 4 tasks
effeminati opened this issue Sep 25, 2023 · 6 comments · Fixed by #8575
Closed
1 of 4 tasks

chunks management with datetime64 and timedelta64 datatype #8230

effeminati opened this issue Sep 25, 2023 · 6 comments · Fixed by #8575
Labels
bug topic-backends topic-zarr Related to zarr storage library

Comments

@effeminati
Copy link

What happened?

I need to perform operations with coordinates or data variables of datetime64[ns] or timedelta64[ns] data types.

Once I save the dataset or data array into Zarr format, the chunk size is arbitrarily modified by the to_zarr() function, even if I explicitly specify the encoding. It is mandatory to use the same chunk size for both disk and memory because I save each portion of the file using the region option of xarray.Dataset.to_zarr().

In addition, when I try to exploit parallelism, I encounter the error message "inconsistent chunk size".

What did you expect to happen?

I expect that the input chunk size is maintained in writing and reading from zarr

Minimal Complete Verifiable Example

import xarray as xr
import dask.array as da
import numpy as np

# define an empty dataarray
ds = xr.DataArray(da.empty(shape=(1_024,2_048), dtype='float64', chunks=512), dims=['y','x'])

# define coordinates (y is datetime64
ds = ds.assign_coords({'azimuth_time': (['y'], np.arange(1024).astype('datetime64[ns]')), 'slant_range_time': (['x'], np.arange(2048).astype('float64'))})

# define chunking
ds = ds.chunk({'x': 512, 'y': 512})

# save dataarray
ds.to_dataset(name='aaa').to_zarr('test.zarr')

# re-read dataarray
ds1 = xr.open_dataset('test.zarr', engine='zarr', chunks={})

# the chunk sizes of the coordinate azimuth time differ
print(ds1.aaa.azimuth_time.chunks, ds.azimuth_time.chunks)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

In [11]: print(ds1.aaa.azimuth_time.chunks, ds.azimuth_time.chunks)
((1024,),) ((512, 512),)

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-1045-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2

xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.6.1
distributed: 2023.6.1
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.9.22
setuptools: 68.0.0
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: 1.4.1
IPython: 8.14.0
sphinx: 7.0.1

@effeminati effeminati added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 25, 2023
@welcome
Copy link

welcome bot commented Sep 25, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@malmans2
Copy link
Contributor

malmans2 commented Sep 28, 2023

I think the problem is here:

var = conventions.encode_cf_variable(var, name=name)

The CFDatetimeCoder returns unchunked variables.

Not sure if there's an easy solution though.
I tried to restore the chunking of var in the line above.
If fixes the MRE, but it breaks some tests.

Here is a similar MRE I've been using:

import xarray as xr
import tempfile

da_expected = xr.DataArray(range(2), name="foo").astype("datetime64[ns]").chunk(1)
with tempfile.TemporaryDirectory() as tmpdir:
    da_expected.to_zarr(tmpdir)
    da_actual = xr.open_dataarray(tmpdir, engine="zarr", chunks={})
    assert da_expected.chunks == da_actual.chunks == ((1, 1),), da_actual.chunks

@dcherian dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 28, 2023
@dcherian
Copy link
Contributor

Is it different if you specify chunks in the encoding kwarg of to_zarr?

@malmans2
Copy link
Contributor

This is the syntax, right?

     da_expected.to_zarr(tmpdir, encoding={"foo": {"chunks": (1, 1)}})

Nope, same error.

@dcherian dcherian added the topic-zarr Related to zarr storage library label Sep 28, 2023
@malmans2
Copy link
Contributor

My bad! I specified the wrong encoding. Explicitly passing the chunks trough encoding works:

import xarray as xr
import tempfile

da_expected = xr.DataArray(range(2), name="foo").astype("datetime64[ns]").chunk(1)
with tempfile.TemporaryDirectory() as tmpdir:
    da_expected.to_zarr(tmpdir, encoding={"foo": {"chunks": (1, )}})
    da_actual = xr.open_dataarray(tmpdir, engine="zarr", chunks={})
    assert da_expected.chunks == da_actual.chunks == ((1, 1),), da_actual.chunks

I think I know how to fix it in the zarr backend, I'll take a look tomorrow.

@dcherian
Copy link
Contributor

The CFDatetimeCoder returns unchunked variables.

This is this longstanding issue: #7132 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-backends topic-zarr Related to zarr storage library
Projects
None yet
3 participants