-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing a datetime coord ignores chunks #8432
Comments
@max-sixty ds = xr.tutorial.load_dataset('air_temperature')
ds = ds.chunk().expand_dims(a=1000).assign_coords(
time2=lambda x: x.time,
time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
)
ds = ds.chunk(time=10)
print(ds) <xarray.Dataset>
Dimensions: (lat: 25, a: 1000, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
time2 (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
time_int (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
air (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
|
@kmuehlbauer is right. I think the bug here (is it really a bug?) is that A more general way to configure Zarr chunks, which is not dependent on Dask, is to use encoding, e.g. store = zarr.storage.MemoryStore()
ds = xr.tutorial.load_dataset('air_temperature')
ds_to_save = (
ds.chunk()
.expand_dims(a=1000)
.assign_coords(
time2=lambda x: x.time,
time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
)
)
chunk_encoding = {"chunks": 10}
encoding = {
"time": chunk_encoding,
"time2": chunk_encoding,
"time_int": chunk_encoding
}
ds_to_save.to_zarr(store, mode="w", encoding=encoding) Unfortunately, when I do this, I get an obscure blosc error 😱
|
Ok the blosc error was a red herring. It was just because the chunks were too big. I made the example a little smaller (and used different Dask and Zarr chunks) and it works great. import xarray as xr
import zarr
import numpy as np
import tempfile
#store = tempfile.TemporaryDirectory().name
store = zarr.storage.MemoryStore()
ds = xr.tutorial.load_dataset('air_temperature')
ds_to_save = (
ds.chunk()
.expand_dims(a=10)
.assign_coords(
time2=lambda x: x.time,
time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
)
.chunk(time=100)
)
# use a divisor of the dask chunks or you will get an error
chunk_encoding = {"chunks": 10}
encoding = {
"time": chunk_encoding,
"time2": chunk_encoding,
"time_int": chunk_encoding
}
ds_to_save.to_zarr(store, mode="w", encoding=encoding)
xr.open_zarr(store) |
Fixed by #8253? |
I don't think either of these are quite right — if I make a version with (
ds.chunk()
.expand_dims(a=1000)
.assign_coords(
time2=lambda x: x.time,
time3=lambda x: (("time"), np.full(ds.sizes["time"], x.time[0])),
time_int=lambda x: (("time"), np.full(ds.sizes["time"], 1)),
)
.chunk(time=10)
.to_zarr("foo.zarr", mode="w")
)
xr.open_zarr('foo.zarr') <xarray.Dataset>
Dimensions: (a: 1000, time: 2920, lat: 25, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
time2 (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray> # here
time3 (time) datetime64[ns] dask.array<chunksize=(2920,), meta=np.ndarray> # here
time_int (time) int64 dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: a
Data variables:
air (a, time, lat, lon) float32 dask.array<chunksize=(1000, 10, 25, 53), meta=np.ndarray>
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948) The object being written is:
|
What happened?
When writing a coord with a datetime type, the chunking on the coord is ignored, and the whole coord is written as a single chunk. (or at least it can be, I haven't done enough to confirm whether it'll always be...)
This can be quite inconvenient. Any attempt to write to that dataset from a distributed process will have errors, since each process will be attempting to write another process's data, rather than only its region. And less severely, the chunks won't be unified.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.9.18 (main, Nov 2 2023, 16:51:22)
[Clang 14.0.3 (clang-1403.0.22.14.1)]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None
xarray: 2023.10.1
pandas: 2.1.1
numpy: 1.26.1
scipy: 1.11.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.5.0
distributed: 2023.5.0
matplotlib: 3.6.0
cartopy: None
seaborn: 0.12.2
numbagg: 0.6.0
fsspec: 2022.8.2
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.9.22
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.0
mypy: 1.6.1
IPython: 8.14.0
sphinx: 5.2.1
The text was updated successfully, but these errors were encountered: