-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Original variable encodings are retained #471
Original variable encodings are retained #471
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. A few thoughts below.
d = ds.to_dict(data=False) | ||
# Remove redundant encoding options | ||
for v in ds.variables: | ||
for option in ["_FillValue", "source"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the rationale for special casing these two options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I excluded these as they were causing certain test failures where expected schemas were compared with actual. E.g. any combiner tests using has_correct_schema()
:
def has_correct_schema(expected_schema):
def _check_results(actual):
assert len(actual) == 1
schema = actual[0]
assert schema == expected_schema
The source
will be unique to each original source data product, and the _FillValue
appeared to be added automatically (I can't recall the specific issue with the latter though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the latter again, and when _FillValue
is retained, it's being set to nan
in expected_schema
(as generated by the original ds.to_dict(data=False, encoding=True)
, but only for the lat
and lon
coords. However, the actual
schema doesn't contain _FillValue
for lat
/lon
, and the assert fails.
# TODO: should be okay to remove _FillValue? | ||
if option in ds[v].encoding: | ||
del ds[v].encoding[option] | ||
d = ds.to_dict(data=False, encoding=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that when I first started working on this, this option didn't even exist yet! See pydata/xarray#6634
Nice when things come together. 😄
# Can combine encoding using the same approach as attrs | ||
encoding = _combine_attrs(v1[vname]["encoding"], v2[vname]["encoding"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant!
pangeo_forge_recipes/aggregation.py
Outdated
# TODO: previous comment regarding encoding should no longer | ||
# be relevant now that variable encoding will be used if available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the irrelevant comment please. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
tests/test_aggregation.py
Outdated
# Confirm original time units have been preserved | ||
assert ds.time.encoding["units"] == dst.time.encoding["units"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this test fail without the changes in aggregation.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and it would fail for the wrong reason (KeyError
) in that situation, I've fixed it and will commit (also in test_writers
)
tests/test_writers.py
Outdated
# Zarr retains the original "days since %Y:%m%d" and removes " %H:%M:%S" | ||
assert " ".join(ds.time.encoding["units"].split(" ")[0:-1]) == ds_target.time.encoding["units"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zarr has nothing to do with it. Xarray and cftime are managing this logic. I think it drops of the time if the time is 0:0:0.
Wouldn't it be easier to fix this at the source (i.e. strip the time in line 39 of data_generation.py
)? That way you can just do
assert ds.time.encoding == ds_target.time.encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally stripped it in data_generation.py
as you suggest, but I was sure that this was causing failures only in test_writer.py
. I've restored this and all tests are passing, so I'm not sure what went wrong previously. Commit is on the way.
for more information, see https://pre-commit.ci
I've been thinking through this more, and and I'm worried that there is an important edge case that is not covered by this. Currently, if pangeo-forge-recipes/pangeo_forge_recipes/aggregation.py Lines 136 to 145 in 241ca87
So when pangeo forge goes to create the target dataset, it will not have any information about the time resolution and needed time encoding, and we will be back in the situation described by this comment: pangeo-forge-recipes/pangeo_forge_recipes/aggregation.py Lines 196 to 202 in 241ca87
This PR and the use case that motivated it assume that the input files will all have the same encoding. But what if they have different encoding? For example, we could have input files representing hourly data with time encoded like this
(yes I have seen this in real datasets) since the time encoding is not uniform, However, xarray would handle the data fine in an If we stick to our current approach of not reading and combining the actual coordinates, we will have to recreate some of this logic within That could be a heavy lift. So perhaps, for the interim, we would just want to raise an error if the time encoding is different for different datasets? More test cases would be useful at exposing edge cases. |
Based on this, I should restore some of the original warning comment text.
Yep, I believe the CCMP files all have the same encoding (units).
When I originally created the Zarr store before the Pangeo Forge recipe, I used
I think replicating or reusing the functionality in
That can definitely be done. |
Actually, a nice middle ground would be to make it really easy to specify the desired time encoding. Like, if I know a priori that my data are daily resolution, I could just put the time encoding directly in the recipe. I'm wondering what would be the right way to specify this... 🤔 |
Would this only work when all input files had the same time encoding? E.g. I guess we'd be back to the current issue in situations like your example above that has multiple units:
|
No, because the data are decoded before they are written, and then re-encoded to match the target encoding. pangeo-forge-recipes/pangeo_forge_recipes/writers.py Lines 27 to 36 in 81a9ce5
|
I understand it now, i.e. the specified target encoding would override any potential variation in input file encodings. |
I opened #480 to track the idea of specifying encoding in the recipe. We can address that in a follow-up PR. For now this is a big step forward. |
time
encoding units not used in Zarr generated bybeam-refactor
branch #465source
and_FillValue
are excluded