Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rezarring an opened dataset with object dtype fails due to added filter #7576

Closed
4 tasks done
saschahofmann opened this issue Mar 2, 2023 · 2 comments
Closed
4 tasks done
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@saschahofmann
Copy link
Contributor

saschahofmann commented Mar 2, 2023

What happened?

I am trying to save an xr.Dataset that I read and processed from another saved zarr file. But it fails with this error

numcodecs/vlen.pyx in numcodecs.vlen.VLenUTF8.encode()
TypeError: expected unicode string, found 3

It seems like the first time the dataset is saved, xarray/zarr is adding a VLenUTF8 filter to the encoding of one of the dimensions. If I pop the filters key from the opened dataset I can resave the file.

I can also safely save to netcdf (which makes sense since this encoding is probably ignored then).

What did you expect to happen?

I should be able to open and resave a file to zarr.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
da= xr.DataArray(np.array(['126469-423', '130042-0-10046', '120259-10343'], dtype='object'), dims=['asset'], name='asset')

da.to_dataset().to_zarr('~/Downloads/test.zarr', mode='w')
# Fails with the error below
opened = xr.open_zarr('~/Downloads/test.zarr')
opened.to_zarr('~/Downloads/test2.zarr', mode='w')

# Saves successfully
opened.asset.encoding.pop('filters')
opened.to_zarr('~Downloads/test2.zarr', mode='w')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

TypeError                                 Traceback (most recent call last)
<ipython-input-16-b1f2f1d2b5a0> in <module>
      6 opened = xr.open_zarr('~/Downloads/test.zarr')
      7 
----> 8 opened.to_zarr('~/Downloads/test2.zarr', mode='w')

~/micromamba/envs/xr/lib/python3.8/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version)
   2097         from xarray.backends.api import to_zarr
   2098 
-> 2099         return to_zarr(  # type: ignore
   2100             self,
   2101             store=store,

~/micromamba/envs/xr/lib/python3.8/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version)
   1668     writer = ArrayWriter()
   1669     # TODO: figure out how to properly handle unlimited_dims
-> 1670     dump_to_store(dataset, zstore, writer, encoding=encoding)
   1671     writes = writer.sync(compute=compute)
   1672 

~/micromamba/envs/xr/lib/python3.8/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1277         variables, attrs = encoder(variables, attrs)
   1278 
-> 1279     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
...
   2112         # check object encoding

numcodecs/vlen.pyx in numcodecs.vlen.VLenUTF8.encode()

TypeError: expected unicode string, found 3

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-124-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 2023.1.0
pandas: 1.5.3
numpy: 1.22.4
scipy: 1.4.1
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.11.0
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.8.5
iris: None
bottleneck: 1.3.2
dask: 2022.01.1
distributed: 2022.01.1
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
fsspec: 0.8.4
cupy: None
pint: 0.16.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 50.3.0.post20201006
pip: 20.2.3
conda: None
pytest: 7.0.1
mypy: None
IPython: 7.18.1
sphinx: None

@saschahofmann saschahofmann added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 2, 2023
@saschahofmann
Copy link
Contributor Author

Just stumbled over this issue seems like the cause is known.

@dcherian
Copy link
Contributor

Thanks @saschahofmann , closing as a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

2 participants