Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when writing string coordinate variables to zarr #3476

Open
jsadler2 opened this issue Nov 1, 2019 · 17 comments
Open

Error when writing string coordinate variables to zarr #3476

jsadler2 opened this issue Nov 1, 2019 · 17 comments
Labels
bug topic-backends topic-zarr Related to zarr storage library

Comments

@jsadler2
Copy link

jsadler2 commented Nov 1, 2019

I saved an xarray dataset to zarr using to_zarr. I then later tried to read that dataset from the original zarr, re-chunk it, and then write to a new zarr. When I did that I get a strange error. I attached a zip of minimal version of the zarr dataset that I am using.
test_sm_zarr.zip

MCVE Code Sample

import xarray as xr

sm_from_zarr = xr.open_zarr('test_sm_zarr')
sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')

Expected Output

No error

Problem Description

I get this error:

C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\merge.py:18: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)
C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\dataarray.py:1829: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  'DataArray', pd.Series, pd.DataFrame, pd.Panel]:
Traceback (most recent call last):
  File "rechunk_test.py", line 38, in <module>
    sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\dataset.py", line 1414, in to_zarr
    consolidated=consolidated, append_dim=append_dim)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\api.py", line 1101, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\api.py", line 929, in dump_to_store
    unlimited_dims=unlimited_dims)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\zarr.py", line 366, in store
    unlimited_dims=unlimited_dims)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\zarr.py", line 432, in set_variables
    writer.add(v.data, zarr_array)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\common.py", line 173, in add
    target[...] = source
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1659, in _chunk_setitem
    fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1723, in _chunk_setitem_nosync
    cdata = self._encode_chunk(chunk)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1769, in _encode_chunk
    chunk = f.encode(chunk)
  File "numcodecs/vlen.pyx", line 108, in numcodecs.vlen.VLenUTF8.encode
TypeError: expected unicode string, found 20

BUT
I think it has something to do with the datatype of one of my coordinates, site_code. Because, if it do this I get no error:

import xarray as xr

sm_from_zarr = xr.open_zarr('test_sm_zarr')
sm_from_zarr['site_code'] = sm_from_zarr.site_code.astype('str')
sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')

Before converting the datatype of the site_code coordinate is object

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.2
pandas: 0.25.1
numpy: 1.17.1
scipy: 1.3.1
netCDF4: 1.5.1.2
pydap: installed
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.3.0
distributed: 2.5.1
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.1.2
IPython: 7.8.0
sphinx: None

@jhamman
Copy link
Member

jhamman commented Nov 1, 2019

Hi @jsadler2 - Can you show us what sm_from_zarr looks like (print(sm_from_zarr)) will do.

@jsadler2
Copy link
Author

jsadler2 commented Nov 5, 2019

Sure

In [5]: print(ds)
<xarray.Dataset>
Dimensions:     (datetime: 20, site_code: 20)
Coordinates:
  * datetime    (datetime) datetime64[ns] 1970-01-01 ... 1970-01-01T19:00:00
  * site_code   (site_code) object '01302020' '01303000' ... '01315000'
Data variables:
    streamflow  (datetime, site_code) float64 dask.array<shape=(20, 20), chunksize=(20, 20)>

@jhamman jhamman added bug topic-backends topic-zarr Related to zarr storage library labels Nov 5, 2019
@jhamman
Copy link
Member

jhamman commented Nov 5, 2019

Thanks @jsadler2 - I think this is a bug in xarray. We should be able to round trip the site_coordinate variable.

@dnerini
Copy link

dnerini commented Apr 27, 2020

I'm experiencing the same issue, which seems to be also related to one of my coordinates having object as datatype. Luckily, the workaround proposed by @jsadler2 works in my case, too.

@borispf
Copy link

borispf commented May 19, 2020

I ran into the same issue. It seems like zarr is inserting VLenUTF8 as a filter, but the loaded data array already has that as a filter so it's trying to double encode . So another workaround is to delete the filters on site_code:

import xarray as xr

sm_from_zarr = xr.open_zarr('tmp/test_sm_zarr')
del sm_from_zarr.site_code.encoding["filters"]
sm_from_zarr.to_zarr('tmp/test_sm_zarr_from', mode='w')

@andersy005
Copy link
Member

I ran into the same issue. It seems like zarr is inserting VLenUTF8 as a filter, but the loaded data array already has that as a filter so it's trying to double encode

Indeed. And it appears that these lines are the culprits:

# new variable
encoding = extract_zarr_variable_encoding(
v, raise_on_invalid=check, name=vn
)

ipdb> v
<xarray.Variable (x: 3)>
array(['a', 'b', 'c'], dtype=object)
ipdb> check
False
ipdb> vn
'x'
ipdb> encoding = extract_zarr_variable_encoding(v, raise_on_invalid=check, name=vn)
ipdb> encoding
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': [VLenUTF8()]}

zarr_array = self.ds.create(
name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding
)

Zarr appears to be ignoring the filter information from xarray. Zarr proceeds to extracting its own filter. As a result, we end up with two filters:

ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding)
ipdb> self.ds['x']._meta
{'zarr_format': 2, 'shape': (3,), 'chunks': (3,), 'dtype': dtype('O'), 'compressor': {'blocksize': 0, 'clevel': 5, 'cname': 'lz4', 'id': 'blosc', 'shuffle': 1}, 'fill_value': None, 'order': 'C', 'filters': [{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]}
ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]

As @borispf and @jsadler2 suggested, clearing the filters from encoding before initiating the zarr store creation works:

ipdb> enc # without filters
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': []}
ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **enc)
ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}]

@jhamman since you are more familiar with the internals of zarr + xarray, should we default to ignoring filter information from xarray and let zarr take care of the extraction of filter information?

@jhamman
Copy link
Member

jhamman commented Dec 16, 2020

Thanks @andersy005 for the write up and for digging into this. Doesn't this seem like it could be a bug in Zarr's create method?

@Hoeze
Copy link

Hoeze commented May 15, 2021

Hi, I also keep running into this issue all the time.
Right now, there is no way of round-tripping xr.open_zarr().to_zarr(), also because of #5219.

The only workaround that seems to help is the following:

to_store = xrds.copy()
for var in to_store.variables:
    to_store[var].encoding.clear()

@bluetyson
Copy link

Thanks for that, looks like I just came across this.

@RichardScottOZ
Copy link
Contributor

Think it worked with one variable - @Hoeze workaround was necessary for more than one.

@delgadom
Copy link
Contributor

delgadom commented May 2, 2022

This has been happening a lot to me lately when writing to zarr. Thanks to @bolliger32 for the tip - this usually works like a charm:

for v in list(ds.coords.keys()):
    if ds.coords[v].dtype == object:
        ds.coords[v] = ds.coords[v].astype("unicode")

for v in list(ds.variables.keys()):
    if ds[v].dtype == object:
        ds[v] = ds[v].astype("unicode")

For whatever reason, clearing the encoding and/or using .astype(str) doesn't seem to work as reliably. I don't have a good MRE for this but hope it helps others with the same issue.

note the flag raised by @FlorisCalkoen below - don't just throw this at all your writes! there are other object types (e.g. CFTime) which you probably don't want to convert to string. This is just a patch to get around this issue for dataarrays with string coords/variables.

@FlorisCalkoen
Copy link

@delgadom I just noticed that your proposed solution has the side effect of also converting cftime objects (e.g., below) to unicode strings.

xarray.DataArray 'time' (time: 1) array([cftime.DatetimeNoLeap(2007, 7, 2, 12, 0, 0, 0, has_year_zero=True)], dtype=object) 

I updated your lines using @Hoeze' clear function and that seems to work for now.

    for v in list(ds.coords.keys()):
        if ds.coords[v].dtype == object:
            ds[v].encoding.clear()

    for v in list(ds.variables.keys()):
        if ds[v].dtype == object:
            ds[v].encoding.clear()

@delgadom
Copy link
Contributor

delgadom commented Aug 8, 2022

ha - yeah that's a good flag. I definitely didn't intend for that to be a universally applied patch! so probably should have included a buyer beware. but we did find that clearing the encoding doesn't always do the trick for string arrays. So a comprehensive patch will probably need to be more nuanced.

@saschahofmann
Copy link
Contributor

My impression is that keeping the zarr encoding leads to a bunch of issues (see my issue above) or the current one. There also seems to be an issue with chunking being preserved because the array encodings arent overwritten, but I cant find that issue right now. Since all of these issues are resolved by popping the zarr encoding I am wondering what are the downsides of this and whether it'd be easier to not keep that encoding at all?

@max-sixty
Copy link
Collaborator

I find this annoying too!

I think we can fold it into #6323, though it's also common enough that maybe we leave it open for future travellers

@cisaacstern
Copy link
Contributor

I hit this issue this week, and the workaround in #3476 (comment) worked for me.

@DahnJ
Copy link

DahnJ commented Jun 11, 2024

I run into this issue fairly commonly (usually during playing copying of a dataset xr.open_zarr(..).to_zarr(..)) and the fix in #3476 (comment) works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests