Writing a datetime coord ignores chunks #8432

max-sixty · 2023-11-09T07:00:39Z

What happened?

When writing a coord with a datetime type, the chunking on the coord is ignored, and the whole coord is written as a single chunk. (or at least it can be, I haven't done enough to confirm whether it'll always be...)

This can be quite inconvenient. Any attempt to write to that dataset from a distributed process will have errors, since each process will be attempting to write another process's data, rather than only its region. And less severely, the chunks won't be unified.

Minimal Complete Verifiable Example

ds = xr.tutorial.load_dataset('air_temperature')

(
    ds.chunk()
    .expand_dims(a=1000)
    .assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
    .chunk(time=10)
    .to_zarr("foo.zarr", mode="w")
)


xr.open_zarr('foo.zarr')


# Note the `chunksize=(2920,)` vs `chunksize=(10,)`!


<xarray.Dataset>
Dimensions:   (a: 1000, time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray>  # here
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>  # here
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

xr.open_zarr('foo.zarr').chunks
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 xr.open_zarr('foo.zarr').chunks

File /opt/homebrew/lib/python3.9/site-packages/xarray/core/dataset.py:2567, in Dataset.chunks(self)
   2552 @property
   2553 def chunks(self) -> Mapping[Hashable, tuple[int, ...]]:
   2554     """
   2555     Mapping from dimension names to block lengths for this dataset's data, or None if
   2556     the underlying data is not a dask array.
   (...)
   2565     xarray.unify_chunks
   2566     """
-> 2567     return get_chunksizes(self.variables.values())

File /opt/homebrew/lib/python3.9/site-packages/xarray/core/common.py:2013, in get_chunksizes(variables)
   2011         for dim, c in v.chunksizes.items():
   2012             if dim in chunks and c != chunks[dim]:
-> 2013                 raise ValueError(
   2014                     f"Object has inconsistent chunks along dimension {dim}. "
   2015                     "This can be fixed by calling unify_chunks()."
   2016                 )
   2017             chunks[dim] = c
   2018 return Frozen(chunks)

ValueError: Object has inconsistent chunks along dimension time. This can be fixed by calling unify_chunks().

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.18 (main, Nov 2 2023, 16:51:22)
[Clang 14.0.3 (clang-1403.0.22.14.1)]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.10.1
pandas: 2.1.1
numpy: 1.26.1
scipy: 1.11.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.5.0
distributed: 2023.5.0
matplotlib: 3.6.0
cartopy: None
seaborn: 0.12.2
numbagg: 0.6.0
fsspec: 2022.8.2
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.9.22
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.0
mypy: 1.6.1
IPython: 8.14.0
sphinx: 5.2.1

kmuehlbauer · 2023-11-09T07:37:55Z

@max-sixty time and time2 are not chunked before your write out to zarr.

ds = xr.tutorial.load_dataset('air_temperature')
ds = ds.chunk().expand_dims(a=1000).assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
ds = ds.chunk(time=10)
print(ds)

<xarray.Dataset>
Dimensions:   (lat: 25, a: 1000, time: 2920, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

rabernat · 2023-11-09T11:10:08Z

@kmuehlbauer is right. I think the bug here (is it really a bug?) is that ds.chunk(time=10) doesn't actually chunk the time2 variable. I think this is because it is referencing the same underlying data as time.

A more general way to configure Zarr chunks, which is not dependent on Dask, is to use encoding, e.g.

store = zarr.storage.MemoryStore()

ds = xr.tutorial.load_dataset('air_temperature')

ds_to_save = (
    ds.chunk()
    .expand_dims(a=1000)
    .assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
)

chunk_encoding = {"chunks": 10}
encoding = {
    "time": chunk_encoding,
    "time2": chunk_encoding,
    "time_int": chunk_encoding
}

ds_to_save.to_zarr(store, mode="w", encoding=encoding)

Unfortunately, when I do this, I get an obscure blosc error 😱

File /srv/conda/envs/notebook/lib/python3.11/site-packages/numcodecs/compat.py:121, in ensure_contiguous_ndarray_like(buf, max_buffer_size, flatten)
    119 if max_buffer_size is not None and arr.nbytes > max_buffer_size:
    120     msg = "Codec does not support buffers of > {} bytes".format(max_buffer_size)
--> 121     raise ValueError(msg)
    123 return arr

ValueError: Codec does not support buffers of > 2147483647 bytes

rabernat · 2023-11-09T11:15:33Z

Ok the blosc error was a red herring. It was just because the chunks were too big. I made the example a little smaller (and used different Dask and Zarr chunks) and it works great.

import xarray as xr
import zarr
import numpy as np

import tempfile

#store = tempfile.TemporaryDirectory().name
store = zarr.storage.MemoryStore()

ds = xr.tutorial.load_dataset('air_temperature')

ds_to_save = (
    ds.chunk()
    .expand_dims(a=10)
    .assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
    .chunk(time=100)
)

# use a divisor of the dask chunks or you will get an error
chunk_encoding = {"chunks": 10}
encoding = {
    "time": chunk_encoding,
    "time2": chunk_encoding,
    "time_int": chunk_encoding
}


ds_to_save.to_zarr(store, mode="w", encoding=encoding)
xr.open_zarr(store)

dcherian · 2023-11-09T16:17:41Z

Fixed by #8253?

max-sixty · 2023-11-09T20:00:28Z

time and time2 are not chunked before your write out to zarr.

I think the bug here (is it really a bug?) is that ds.chunk(time=10) doesn't actually chunk the time2 variable. I think this is because it is referencing the same underlying data as time.

I don't think either of these are quite right — if I make a version with time3 that doesn't reference time, and is chunked, we still get the bug:

(
    ds.chunk()
    .expand_dims(a=1000)
    .assign_coords(
        time2=lambda x: x.time,
        time3=lambda x: (("time"), np.full(ds.sizes["time"], x.time[0])),
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
    .chunk(time=10)
    .to_zarr("foo.zarr", mode="w")
)

xr.open_zarr('foo.zarr')

<xarray.Dataset>
Dimensions:   (a: 1000, time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray>  # here
    time3     (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray>  # here
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

The object being written is:

<xarray.Dataset>
Dimensions:   (lat: 25, a: 1000, time: 2920, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time3     (time) datetime64[ns] dask.array<chunksize=(10,), meta=np.ndarray>  # chunked
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

max-sixty added bug needs triage Issue that has not been reviewed by xarray team member topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 9, 2023

max-sixty mentioned this issue Nov 9, 2023

fix zarr datetime64 chunks #8253

Closed

3 tasks

spencerkclark mentioned this issue Dec 31, 2023

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Merged

7 tasks

dcherian closed this as completed in #8575 Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing a datetime coord ignores chunks #8432

Writing a datetime coord ignores chunks #8432

max-sixty commented Nov 9, 2023 •

edited

Loading

INSTALLED VERSIONS

kmuehlbauer commented Nov 9, 2023

rabernat commented Nov 9, 2023

rabernat commented Nov 9, 2023

dcherian commented Nov 9, 2023

max-sixty commented Nov 9, 2023

Writing a datetime coord ignores chunks #8432

Writing a datetime coord ignores chunks #8432

Comments

max-sixty commented Nov 9, 2023 • edited Loading

What happened?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

kmuehlbauer commented Nov 9, 2023

rabernat commented Nov 9, 2023

rabernat commented Nov 9, 2023

dcherian commented Nov 9, 2023

max-sixty commented Nov 9, 2023

max-sixty commented Nov 9, 2023 •

edited

Loading