-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving and loading an array of strings changes datatype to object #7652
Comments
It seems that the "string" information is stored in the xarray/xarray/backends/h5netcdf_.py Lines 208 to 210 in f1ff956
but this encoding is not "applied" to the dtype of the dataset's variable. |
A similar problem where saving, loading, saving, loading changes the dtype import xarray as xr
da1 = xr.DataArray(data=[True], dims=["x"], coords={"x": [0]})
da1.to_netcdf("test.nc", mode="w")
da2 = xr.load_dataarray("test.nc")
da2.to_netcdf("test.nc", mode="w")
da3 = xr.load_dataarray("test.nc")
assert da1.dtype == da3.dtype, "Dtypes don't match" |
Another fun one where import xarray as xr
da = xr.DataArray(data=[1], dims=["x"], coords={"x": [0]})
da.to_netcdf("test.nc", mode="w")
da2 = xr.load_dataarray("test.nc")
da.dtype, da2.dtype |
@basnijholt For the string issue this is somehwat kind of netcdf/numpy based issue with VLEN types. XRef: https://unidata.github.io/netcdf4-python/#dealing-with-strings
And numpy will create a VLEN string array if no dtype is given, like in your case. At least netCDF4 and h5netcdf backends are consistent in their writing (creating similar hdf5-files) and reading back (object-dtype): plain netCDF4import netCDF4 as nc
import numpy as np
data = np.array([["a", "b"], ["c", "d"]], dtype="<U1")
print(f"source dtype: {data.dtype.str}\n", )
auto = False
with nc.Dataset("test-plain-netcdf4.nc", mode="w") as ds:
print("Write NC-File")
ds.set_auto_maskandscale(auto)
ds.set_auto_chartostring(auto)
ds.createDimension("x", size=2)
ds.createDimension("y", size=2)
var = ds.createVariable("da", data.dtype.str, dimensions=("x", "y"))
var[:] = data
print("Variable\n")
print(var)
print(var.dtype)
print("\nContents\n")
print(var[:])
print(var[:].dtype)
with nc.Dataset("test-plain-netcdf4.nc") as ds:
print("\nRead NC-File")
ds.set_auto_maskandscale(auto)
ds.set_auto_chartostring(auto)
da = ds["da"]
print("Variable\n")
print(da)
print(da.dtype)
da = ds["da"][:]
print("\nContents\n")
print(da)
print(da.dtype) source dtype: <U1
Write NC-File
Variable
<class 'netCDF4._netCDF4.Variable'>
vlen da(x, y)
vlen data type: <class 'str'>
unlimited dimensions:
current shape = (2, 2)
<class 'str'>
Contents
[['a' 'b']
['c' 'd']]
object
Read NC-File
Variable
<class 'netCDF4._netCDF4.Variable'>
vlen da(x, y)
vlen data type: <class 'str'>
unlimited dimensions:
current shape = (2, 2)
<class 'str'>
Contents
[['a' 'b']
['c' 'd']]
object netcdf test-plain-netcdf4 {
dimensions:
x = 2 ;
y = 2 ;
variables:
string da(x, y) ;
data:
da =
"a", "b",
"c", "d" ;
}
HDF5 "test-plain-netcdf4.nc" {
DATASET "da" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
DATA {
(0,0): "a", "b",
(1,0): "c", "d"
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): (), ()
}
}
ATTRIBUTE "_Netcdf4Coordinates" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): 0, 1
}
}
}
} plain h5netcdfimport h5netcdf.legacyapi as h5nc
import h5py
data = np.array([["a", "b"], ["c", "d"]], dtype="<U1")
print(f"source dtype: {data.dtype.str}\n", )
with h5nc.Dataset("test-plain-h5netcdf.nc", mode="w") as ds:
print("Write NC-File")
ds.createDimension("x", 2)
ds.createDimension("y", 2)
dtype = h5py.string_dtype()
print("Source dtype:", dtype)
var = ds.createVariable("da", dtype, dimensions=("x", "y"))
var[:] = data
print("Variable\n")
print(var)
print(var.dtype)
print("\nContents\n")
print(var[:])
print(var[:].dtype)
with h5nc.Dataset("test-plain-h5netcdf.nc") as ds:
print("\nRead NC-File")
da = ds["da"]
print("Variable\n")
print(da)
print(da.dtype)
da = ds["da"][:]
print("\nContents\n")
print(da)
print(da.dtype) source dtype: <U1
Write NC-File
Source dtype: object
Variable
<h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>>
Attributes:
<class 'str'>
Contents
[['a' 'b']
['c' 'd']]
object
Read NC-File
Variable
<h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>>
Attributes:
<class 'str'>
Contents
[['a' 'b']
['c' 'd']]
object netcdf test-plain-h5netcdf {
dimensions:
x = 2 ;
y = 2 ;
variables:
string da(x, y) ;
data:
da =
"a", "b",
"c", "d" ;
}
HDF5 "test-plain-h5netcdf.nc" {
DATASET "da" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
DATA {
(0,0): "a", "b",
(1,0): "c", "d"
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): (), ()
}
}
ATTRIBUTE "_Netcdf4Coordinates" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): 0, 1
}
}
ATTRIBUTE "_Netcdf4Dimid" {
DATATYPE H5T_STD_I32LE
DATASPACE SCALAR
DATA {
(0): 0
}
}
}
} Both get written out as:
If you use fixed length strings (eg. import xarray as xr
# Make an xarray with an array of fixed-length strings
data = np.array([["a", "b"], ["c", "d"]], dtype="|S1")
da = xr.DataArray(
data=data,
dims=["x", "y"],
coords={"x": [0, 1], "y": [0, 1]},
)
da.to_netcdf("test.nc", mode='w')
# Load the xarray back in
da_loaded = xr.load_dataarray("test.nc")
assert da.dtype == da_loaded.dtype, "Dtypes don't match" Versions
|
That's an issue with netcdf file format, too, it has no bool-dtype. data = np.array([True], dtype=bool)
with nc.Dataset("test-bool-netcdf4.nc", mode="w") as ds:
ds.createDimension("x", size=1)
var = ds.createVariable("da", data.dtype.str, dimensions=("x"))
var[:] = data
Update: Reason: Lines 400 to 404 in f1ff956
|
Can't reproduce this one with my environment. See above for details. |
@basnijholt I'd appreciate if you could test #7654 for that particular case. Update: added another commit which handles the vlen string case. |
Thanks a lot @kmuehlbauer! I replied in your PR 😄
|
Great, much appreciated, thanks! Let's iterate over there then. |
OK, I've finally gotten to the bottom of this, so I'm writing my findings here:
This works with
My suggestion would be, just use Footnotes |
@kmuehlbauer this is amazing! It would be very valuable to add this list of limitations to the documentation: https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf |
@kmuehlbauer, great! I can confirm that both problems are indeed fixed on my end when using |
I've added a bit to this over at #7654. |
What is your issue?
See the code below
Now
da_loaded.dtype
isdtype('O')
.Same happens with
engine="h5netcdf"
.The text was updated successfully, but these errors were encountered: