chunks management with datetime64 and timedelta64 datatype #8230

effeminati · 2023-09-25T09:26:30Z

What happened?

I need to perform operations with coordinates or data variables of datetime64[ns] or timedelta64[ns] data types.

Once I save the dataset or data array into Zarr format, the chunk size is arbitrarily modified by the to_zarr() function, even if I explicitly specify the encoding. It is mandatory to use the same chunk size for both disk and memory because I save each portion of the file using the region option of xarray.Dataset.to_zarr().

In addition, when I try to exploit parallelism, I encounter the error message "inconsistent chunk size".

What did you expect to happen?

I expect that the input chunk size is maintained in writing and reading from zarr

Minimal Complete Verifiable Example

import xarray as xr
import dask.array as da
import numpy as np

# define an empty dataarray
ds = xr.DataArray(da.empty(shape=(1_024,2_048), dtype='float64', chunks=512), dims=['y','x'])

# define coordinates (y is datetime64
ds = ds.assign_coords({'azimuth_time': (['y'], np.arange(1024).astype('datetime64[ns]')), 'slant_range_time': (['x'], np.arange(2048).astype('float64'))})

# define chunking
ds = ds.chunk({'x': 512, 'y': 512})

# save dataarray
ds.to_dataset(name='aaa').to_zarr('test.zarr')

# re-read dataarray
ds1 = xr.open_dataset('test.zarr', engine='zarr', chunks={})

# the chunk sizes of the coordinate azimuth time differ
print(ds1.aaa.azimuth_time.chunks, ds.azimuth_time.chunks)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

In [11]: print(ds1.aaa.azimuth_time.chunks, ds.azimuth_time.chunks)
((1024,),) ((512, 512),)

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-1045-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2

xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.6.1
distributed: 2023.6.1
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.9.22
setuptools: 68.0.0
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: 1.4.1
IPython: 8.14.0
sphinx: 7.0.1

welcome · 2023-09-25T09:26:32Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

malmans2 · 2023-09-28T15:22:49Z

I think the problem is here:

xarray/xarray/backends/zarr.py

Line 309 in c3b5ead

var = conventions.encode_cf_variable(var, name=name)

The CFDatetimeCoder returns unchunked variables.

Not sure if there's an easy solution though.
I tried to restore the chunking of var in the line above.
If fixes the MRE, but it breaks some tests.

Here is a similar MRE I've been using:

import xarray as xr
import tempfile

da_expected = xr.DataArray(range(2), name="foo").astype("datetime64[ns]").chunk(1)
with tempfile.TemporaryDirectory() as tmpdir:
    da_expected.to_zarr(tmpdir)
    da_actual = xr.open_dataarray(tmpdir, engine="zarr", chunks={})
    assert da_expected.chunks == da_actual.chunks == ((1, 1),), da_actual.chunks

dcherian · 2023-09-28T15:27:56Z

Is it different if you specify chunks in the encoding kwarg of to_zarr?

malmans2 · 2023-09-28T15:32:49Z

This is the syntax, right?

     da_expected.to_zarr(tmpdir, encoding={"foo": {"chunks": (1, 1)}})

Nope, same error.

malmans2 · 2023-09-28T20:49:36Z

My bad! I specified the wrong encoding. Explicitly passing the chunks trough encoding works:

import xarray as xr
import tempfile

da_expected = xr.DataArray(range(2), name="foo").astype("datetime64[ns]").chunk(1)
with tempfile.TemporaryDirectory() as tmpdir:
    da_expected.to_zarr(tmpdir, encoding={"foo": {"chunks": (1, )}})
    da_actual = xr.open_dataarray(tmpdir, engine="zarr", chunks={})
    assert da_expected.chunks == da_actual.chunks == ((1, 1),), da_actual.chunks

I think I know how to fix it in the zarr backend, I'll take a look tomorrow.

dcherian · 2023-11-12T04:58:43Z

The CFDatetimeCoder returns unchunked variables.

This is this longstanding issue: #7132 (comment)

effeminati added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 25, 2023

dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 28, 2023

dcherian added the topic-zarr Related to zarr storage library label Sep 28, 2023

malmans2 mentioned this issue Sep 28, 2023

fix zarr datetime64 chunks #8253

Closed

3 tasks

spencerkclark mentioned this issue Dec 31, 2023

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Merged

7 tasks

dcherian closed this as completed in #8575 Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunks management with datetime64 and timedelta64 datatype #8230

chunks management with datetime64 and timedelta64 datatype #8230

effeminati commented Sep 25, 2023

INSTALLED VERSIONS

welcome bot commented Sep 25, 2023

malmans2 commented Sep 28, 2023 •

edited

Loading

dcherian commented Sep 28, 2023

malmans2 commented Sep 28, 2023

malmans2 commented Sep 28, 2023

dcherian commented Nov 12, 2023

chunks management with datetime64 and timedelta64 datatype #8230

chunks management with datetime64 and timedelta64 datatype #8230

Comments

effeminati commented Sep 25, 2023

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

welcome bot commented Sep 25, 2023

malmans2 commented Sep 28, 2023 • edited Loading

dcherian commented Sep 28, 2023

malmans2 commented Sep 28, 2023

malmans2 commented Sep 28, 2023

dcherian commented Nov 12, 2023

malmans2 commented Sep 28, 2023 •

edited

Loading