-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Original variable encodings are retained #471
Changes from all commits
496036a
0dae924
ad2a610
d9311aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,11 +23,17 @@ class XarraySchema(TypedDict): | |
|
||
|
||
def dataset_to_schema(ds: xr.Dataset) -> XarraySchema: | ||
"""Convert the output of `dataset.to_dict(data=False)` to a schema | ||
"""Convert the output of `dataset.to_dict(data=False, encoding=True)` to a schema | ||
(Basically justs adds chunks, which is not part of the Xarray ouput). | ||
""" | ||
|
||
d = ds.to_dict(data=False) | ||
# Remove redundant encoding options | ||
for v in ds.variables: | ||
for option in ["_FillValue", "source"]: | ||
# TODO: should be okay to remove _FillValue? | ||
if option in ds[v].encoding: | ||
del ds[v].encoding[option] | ||
d = ds.to_dict(data=False, encoding=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that when I first started working on this, this option didn't even exist yet! See pydata/xarray#6634 Nice when things come together. 😄 |
||
return XarraySchema( | ||
attrs=d.get("attrs"), | ||
coords=d.get("coords"), | ||
|
@@ -164,6 +170,8 @@ def _combine_vars(v1, v2, concat_dim, allow_both=False): | |
raise DatasetCombineError(f"Can't merge datasets with the same variable {vname}") | ||
attrs = _combine_attrs(v1[vname]["attrs"], v2[vname]["attrs"]) | ||
dtype = _combine_dtype(v1[vname]["dtype"], v2[vname]["dtype"]) | ||
# Can combine encoding using the same approach as attrs | ||
encoding = _combine_attrs(v1[vname]["encoding"], v2[vname]["encoding"]) | ||
Comment on lines
+173
to
+174
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Brilliant! |
||
(d1, s1), (d2, s2) = ( | ||
(v1[vname]["dims"], v1[vname]["shape"]), | ||
(v2[vname]["dims"], v2[vname]["shape"]), | ||
|
@@ -182,7 +190,14 @@ def _combine_vars(v1, v2, concat_dim, allow_both=False): | |
) | ||
else: | ||
shape.append(l1) | ||
new_vars[vname] = {"dims": dims, "attrs": attrs, "dtype": dtype, "shape": tuple(shape)} | ||
new_vars[vname] = { | ||
"dims": dims, | ||
"attrs": attrs, | ||
"dtype": dtype, | ||
"shape": tuple(shape), | ||
"encoding": encoding, | ||
} | ||
|
||
return new_vars | ||
|
||
|
||
|
@@ -195,13 +210,10 @@ def _to_variable(template, target_chunks): | |
chunks = tuple(target_chunks[dim] for dim in dims) | ||
# we pick zeros as the safest value to initialize empty data with | ||
# will only be used for dimension coordinates | ||
# WARNING: there are lots of edge cases aroudn time! | ||
# Xarray will pick a time encoding for the dataset (e.g. "days since days since 1970-01-01") | ||
# and this may not be compatible with the actual values in the time coordinate | ||
# (which we don't know yet) | ||
data = dsa.zeros(shape=shape, chunks=chunks, dtype=dtype) | ||
# TODO: add more encoding | ||
encoding = {"chunks": chunks} | ||
encoding = template.get("encoding", {}) | ||
encoding["chunks"] = chunks | ||
return xr.Variable(dims=dims, data=data, attrs=template["attrs"], encoding=encoding) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the rationale for special casing these two options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I excluded these as they were causing certain test failures where expected schemas were compared with actual. E.g. any combiner tests using
has_correct_schema()
:https://github.com/pangeo-forge/pangeo-forge-recipes/blob/beam-refactor/tests/test_combiners.py#L98-L102
The
source
will be unique to each original source data product, and the_FillValue
appeared to be added automatically (I can't recall the specific issue with the latter though).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the latter again, and when
_FillValue
is retained, it's being set tonan
inexpected_schema
(as generated by the originalds.to_dict(data=False, encoding=True)
, but only for thelat
andlon
coords. However, theactual
schema doesn't contain_FillValue
forlat
/lon
, and the assert fails.