Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a datetime coord ignores chunks #8432

Closed
5 tasks done
max-sixty opened this issue Nov 9, 2023 · 5 comments · Fixed by #8575
Closed
5 tasks done

Writing a datetime coord ignores chunks #8432

max-sixty opened this issue Nov 9, 2023 · 5 comments · Fixed by #8575
Labels
bug topic-zarr Related to zarr storage library

Comments

@max-sixty
Copy link
Collaborator

max-sixty commented Nov 9, 2023

What happened?

When writing a coord with a datetime type, the chunking on the coord is ignored, and the whole coord is written as a single chunk. (or at least it can be, I haven't done enough to confirm whether it'll always be...)

This can be quite inconvenient. Any attempt to write to that dataset from a distributed process will have errors, since each process will be attempting to write another process's data, rather than only its region. And less severely, the chunks won't be unified.

Minimal Complete Verifiable Example

ds = xr.tutorial.load_dataset('air_temperature')

(
    ds.chunk()
    .expand_dims(a=1000)
    .assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
    .chunk(time=10)
    .to_zarr("foo.zarr", mode="w")
)


xr.open_zarr('foo.zarr')


# Note the `chunksize=(2920,)` vs `chunksize=(10,)`!


<xarray.Dataset>
Dimensions:   (a: 1000, time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray>  # here
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>  # here
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

xr.open_zarr('foo.zarr').chunks
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 xr.open_zarr('foo.zarr').chunks

File /opt/homebrew/lib/python3.9/site-packages/xarray/core/dataset.py:2567, in Dataset.chunks(self)
   2552 @property
   2553 def chunks(self) -> Mapping[Hashable, tuple[int, ...]]:
   2554     """
   2555     Mapping from dimension names to block lengths for this dataset's data, or None if
   2556     the underlying data is not a dask array.
   (...)
   2565     xarray.unify_chunks
   2566     """
-> 2567     return get_chunksizes(self.variables.values())

File /opt/homebrew/lib/python3.9/site-packages/xarray/core/common.py:2013, in get_chunksizes(variables)
   2011         for dim, c in v.chunksizes.items():
   2012             if dim in chunks and c != chunks[dim]:
-> 2013                 raise ValueError(
   2014                     f"Object has inconsistent chunks along dimension {dim}. "
   2015                     "This can be fixed by calling unify_chunks()."
   2016                 )
   2017             chunks[dim] = c
   2018 return Frozen(chunks)

ValueError: Object has inconsistent chunks along dimension time. This can be fixed by calling unify_chunks().

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.18 (main, Nov 2 2023, 16:51:22)
[Clang 14.0.3 (clang-1403.0.22.14.1)]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.10.1
pandas: 2.1.1
numpy: 1.26.1
scipy: 1.11.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.5.0
distributed: 2023.5.0
matplotlib: 3.6.0
cartopy: None
seaborn: 0.12.2
numbagg: 0.6.0
fsspec: 2022.8.2
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.9.22
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.0
mypy: 1.6.1
IPython: 8.14.0
sphinx: 5.2.1

@max-sixty max-sixty added bug needs triage Issue that has not been reviewed by xarray team member topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 9, 2023
@kmuehlbauer
Copy link
Contributor

@max-sixty time and time2 are not chunked before your write out to zarr.

ds = xr.tutorial.load_dataset('air_temperature')
ds = ds.chunk().expand_dims(a=1000).assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
ds = ds.chunk(time=10)
print(ds)
<xarray.Dataset>
Dimensions:   (lat: 25, a: 1000, time: 2920, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

@rabernat
Copy link
Contributor

rabernat commented Nov 9, 2023

@kmuehlbauer is right. I think the bug here (is it really a bug?) is that ds.chunk(time=10) doesn't actually chunk the time2 variable. I think this is because it is referencing the same underlying data as time.

A more general way to configure Zarr chunks, which is not dependent on Dask, is to use encoding, e.g.

store = zarr.storage.MemoryStore()

ds = xr.tutorial.load_dataset('air_temperature')

ds_to_save = (
    ds.chunk()
    .expand_dims(a=1000)
    .assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
)

chunk_encoding = {"chunks": 10}
encoding = {
    "time": chunk_encoding,
    "time2": chunk_encoding,
    "time_int": chunk_encoding
}

ds_to_save.to_zarr(store, mode="w", encoding=encoding)

Unfortunately, when I do this, I get an obscure blosc error 😱

File /srv/conda/envs/notebook/lib/python3.11/site-packages/numcodecs/compat.py:121, in ensure_contiguous_ndarray_like(buf, max_buffer_size, flatten)
    119 if max_buffer_size is not None and arr.nbytes > max_buffer_size:
    120     msg = "Codec does not support buffers of > {} bytes".format(max_buffer_size)
--> 121     raise ValueError(msg)
    123 return arr

ValueError: Codec does not support buffers of > 2147483647 bytes

@rabernat
Copy link
Contributor

rabernat commented Nov 9, 2023

Ok the blosc error was a red herring. It was just because the chunks were too big. I made the example a little smaller (and used different Dask and Zarr chunks) and it works great.

import xarray as xr
import zarr
import numpy as np

import tempfile

#store = tempfile.TemporaryDirectory().name
store = zarr.storage.MemoryStore()

ds = xr.tutorial.load_dataset('air_temperature')

ds_to_save = (
    ds.chunk()
    .expand_dims(a=10)
    .assign_coords(
        time2=lambda x: x.time,
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
    .chunk(time=100)
)

# use a divisor of the dask chunks or you will get an error
chunk_encoding = {"chunks": 10}
encoding = {
    "time": chunk_encoding,
    "time2": chunk_encoding,
    "time_int": chunk_encoding
}


ds_to_save.to_zarr(store, mode="w", encoding=encoding)
xr.open_zarr(store)

@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2023

Fixed by #8253?

@max-sixty
Copy link
Collaborator Author

time and time2 are not chunked before your write out to zarr.

I think the bug here (is it really a bug?) is that ds.chunk(time=10) doesn't actually chunk the time2 variable. I think this is because it is referencing the same underlying data as time.

I don't think either of these are quite right — if I make a version with time3 that doesn't reference time, and is chunked, we still get the bug:

(
    ds.chunk()
    .expand_dims(a=1000)
    .assign_coords(
        time2=lambda x: x.time,
        time3=lambda x: (("time"), np.full(ds.sizes["time"], x.time[0])),
        time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
    )
    .chunk(time=10)
    .to_zarr("foo.zarr", mode="w")
)

xr.open_zarr('foo.zarr')
<xarray.Dataset>
Dimensions:   (a: 1000, time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray>  # here
    time3     (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray>  # here
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

The object being written is:

<xarray.Dataset>
Dimensions:   (lat: 25, a: 1000, time: 2920, lon: 53)
Coordinates:
  * lat       (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time2     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    time3     (time) datetime64[ns] dask.array<chunksize=(10,), meta=np.ndarray>  # chunked
    time_int  (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
    air       (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants