Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original variable encodings are retained #471

Conversation

derekocallaghan
Copy link
Contributor

$ pytest tests
======================================================================================================================== test session starts =========================================================================================================================
platform linux -- Python 3.9.13, pytest-7.2.0, pluggy-1.0.0
rootdir: .../pangeo-forge-recipes, configfile: setup.cfg
plugins: anyio-3.6.2, lazy-fixture-0.6.3
collected 318 items / 1 skipped                                                                                                                                                                                                                                      

tests/test_aggregation.py .....                                                                                                                                                                                                                                [  1%]
tests/test_chunk_grid.py ........                                                                                                                                                                                                                              [  4%]
tests/test_combiners.py ..                                                                                                                                                                                                                                     [  4%]
tests/test_end_to_end.py ...                                                                                                                                                                                                                                   [  5%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 10%]
tests/test_combiners.py ..                                                                                                                                                                                                                                     [ 11%]
tests/test_end_to_end.py ...                                                                                                                                                                                                                                   [ 12%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 17%]
tests/test_combiners.py ..                                                                                                                                                                                                                                     [ 17%]
tests/test_end_to_end.py ...                                                                                                                                                                                                                                   [ 18%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 23%]
tests/test_combiners.py ......                                                                                                                                                                                                                                 [ 25%]
tests/test_locking.py ......                                                                                                                                                                                                                                   [ 27%]
tests/test_openers.py ........................                                                                                                                                                                                                                 [ 35%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 40%]
tests/test_openers.py ..........................................................................                                                                                                                                                               [ 63%]
tests/test_patterns.py ........ss.............ss..............                                                                                                                                                                                                 [ 75%]
tests/test_pipelines.py ........                                                                                                                                                                                                                               [ 78%]
tests/test_rechunking.py .....................................                                                                                                                                                                                                 [ 89%]
tests/test_storage.py .......                                                                                                                                                                                                                                  [ 92%]
tests/test_transforms.py ......................                                                                                                                                                                                                                [ 99%]
tests/test_utils.py ..                                                                                                                                                                                                                                         [ 99%]
tests/test_writers.py .                                                                                                                                                                                                                                        [100%]

========================================================================================================================== warnings summary ==========================================================================================================================

<EXCLUDED WARNINGS>

====================================================================================================== 314 passed, 5 skipped, 28 warnings in 190.78s (0:03:10) =======================================================================================================

@derekocallaghan derekocallaghan changed the title Original variable encodings are retained (#465) Original variable encodings are retained Jan 12, 2023
@rabernat rabernat self-requested a review January 16, 2023 19:51
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. A few thoughts below.

d = ds.to_dict(data=False)
# Remove redundant encoding options
for v in ds.variables:
for option in ["_FillValue", "source"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the rationale for special casing these two options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I excluded these as they were causing certain test failures where expected schemas were compared with actual. E.g. any combiner tests using has_correct_schema():

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/beam-refactor/tests/test_combiners.py#L98-L102

def has_correct_schema(expected_schema):
    def _check_results(actual):
        assert len(actual) == 1
        schema = actual[0]
        assert schema == expected_schema

The source will be unique to each original source data product, and the _FillValue appeared to be added automatically (I can't recall the specific issue with the latter though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the latter again, and when _FillValue is retained, it's being set to nan in expected_schema (as generated by the original ds.to_dict(data=False, encoding=True), but only for the lat and lon coords. However, the actual schema doesn't contain _FillValue for lat/lon, and the assert fails.

# TODO: should be okay to remove _FillValue?
if option in ds[v].encoding:
del ds[v].encoding[option]
d = ds.to_dict(data=False, encoding=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when I first started working on this, this option didn't even exist yet! See pydata/xarray#6634

Nice when things come together. 😄

Comment on lines +173 to +174
# Can combine encoding using the same approach as attrs
encoding = _combine_attrs(v1[vname]["encoding"], v2[vname]["encoding"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!

Comment on lines 217 to 218
# TODO: previous comment regarding encoding should no longer
# be relevant now that variable encoding will be used if available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the irrelevant comment please. 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment on lines 34 to 35
# Confirm original time units have been preserved
assert ds.time.encoding["units"] == dst.time.encoding["units"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this test fail without the changes in aggregation.py?

Copy link
Contributor Author

@derekocallaghan derekocallaghan Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it would fail for the wrong reason (KeyError) in that situation, I've fixed it and will commit (also in test_writers)

Comment on lines 143 to 144
# Zarr retains the original "days since %Y:%m%d" and removes " %H:%M:%S"
assert " ".join(ds.time.encoding["units"].split(" ")[0:-1]) == ds_target.time.encoding["units"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr has nothing to do with it. Xarray and cftime are managing this logic. I think it drops of the time if the time is 0:0:0.

Wouldn't it be easier to fix this at the source (i.e. strip the time in line 39 of data_generation.py)? That way you can just do

assert ds.time.encoding == ds_target.time.encoding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally stripped it in data_generation.py as you suggest, but I was sure that this was causing failures only in test_writer.py. I've restored this and all tests are passing, so I'm not sure what went wrong previously. Commit is on the way.

@rabernat
Copy link
Contributor

rabernat commented Jan 17, 2023

I've been thinking through this more, and and I'm worried that there is an important edge case that is not covered by this. Currently, if encoding['units'] is not the same across all of the datasets, it will simply be dropped, following the logic in _combine_attrs:

def _combine_attrs(a1: dict, a2: dict) -> dict:
if not a1:
return a2
# for now, only keep attrs that are the same in both datasets
common_attrs = set(a1) & set(a2)
new_attrs = {}
for key in common_attrs:
if a1[key] == a2[key]:
new_attrs[key] = a1[key]
return new_attrs

So when pangeo forge goes to create the target dataset, it will not have any information about the time resolution and needed time encoding, and we will be back in the situation described by this comment:

# we pick zeros as the safest value to initialize empty data with
# will only be used for dimension coordinates
# WARNING: there are lots of edge cases aroudn time!
# Xarray will pick a time encoding for the dataset (e.g. "days since days since 1970-01-01")
# and this may not be compatible with the actual values in the time coordinate
# (which we don't know yet)
data = dsa.zeros(shape=shape, chunks=chunks, dtype=dtype)

This PR and the use case that motivated it assume that the input files will all have the same encoding. But what if they have different encoding?

For example, we could have input files representing hourly data with time encoded like this

0 hours since 2000-01-01 00:00:00
0 hours since 2000-01-01 01:00:00
0 hours since 2000-01-01 02:00:00

(yes I have seen this in real datasets)

since the time encoding is not uniform, units would just be dropped. And Xarray would probably pick days since ... as the encoding when initializing the target dataset. This would probably screw up the time coordinate.

However, xarray would handle the data fine in an open_mfdataset situation, because it would fist decode each time to an actual datetime type. Upon saving, it would determine the encoding using this function.

If we stick to our current approach of not reading and combining the actual coordinates, we will have to recreate some of this logic within aggregate.py. Specifically, we will want to determine the minimum frequency (days, hours, s, etc.) and reference date for the time encoding by intelligently combining the encoding of each dataset.

That could be a heavy lift. So perhaps, for the interim, we would just want to raise an error if the time encoding is different for different datasets?

More test cases would be useful at exposing edge cases.

@derekocallaghan
Copy link
Contributor Author

derekocallaghan commented Jan 17, 2023

So when pangeo forge goes to create the target dataset, it will not have any information about the time resolution and needed time encoding, and we will be back in the situation described by this comment:

# we pick zeros as the safest value to initialize empty data with
# will only be used for dimension coordinates
# WARNING: there are lots of edge cases aroudn time!
# Xarray will pick a time encoding for the dataset (e.g. "days since days since 1970-01-01")
# and this may not be compatible with the actual values in the time coordinate
# (which we don't know yet)
data = dsa.zeros(shape=shape, chunks=chunks, dtype=dtype)

Based on this, I should restore some of the original warning comment text.

This PR and the use case that motivated it assume that the input files will all have the same encoding. But what if they have different encoding?

Yep, I believe the CCMP files all have the same encoding (units).

However, xarray would handle the data fine in an open_mfdataset situation, because it would fist decode each time to an actual datetime type. Upon saving, it would determine the encoding using this function.

When I originally created the Zarr store before the Pangeo Forge recipe, I used open_mfdataset(), which worked and generated a suitable time encoding. (I was thinking of suggesting this just before I read your comment here).

If we stick to our current approach of not reading and combining the actual coordinates, we will have to recreate some of this logic within aggregate.py. Specifically, we will want to determine the minimum frequency (days, hours, s, etc.) and reference date for the time encoding by intelligently combining the encoding of each dataset.

I think replicating or reusing the functionality in xarray.coding.time.infer_datetime_units() may be something worth looking at, possibly as a custom transform that precedes DetermineSchema in StoreToZarr.expand(), or instead in OpenWithinXarray.expand(). I need to do something not too far removed for the ASCAT recipes, where I'm hoping to use a custom transform to reorder the collection based on a modified time, prior to StoreToZarr. I'm not sure yet whether this is possible, but I can look into it.

That could be a heavy lift. So perhaps, for the interim, we would just want to raise an error if the time encoding is different for different datasets?

That can definitely be done.

@rabernat
Copy link
Contributor

Actually, a nice middle ground would be to make it really easy to specify the desired time encoding. Like, if I know a priori that my data are daily resolution, I could just put the time encoding directly in the recipe. I'm wondering what would be the right way to specify this... 🤔

@derekocallaghan
Copy link
Contributor Author

Actually, a nice middle ground would be to make it really easy to specify the desired time encoding. Like, if I know a priori that my data are daily resolution, I could just put the time encoding directly in the recipe. I'm wondering what would be the right way to specify this... thinking

Would this only work when all input files had the same time encoding? E.g. I guess we'd be back to the current issue in situations like your example above that has multiple units:

0 hours since 2000-01-01 00:00:00
0 hours since 2000-01-01 01:00:00
0 hours since 2000-01-01 02:00:00

@rabernat
Copy link
Contributor

Would this only work when all input files had the same time encoding?

No, because the data are decoded before they are written, and then re-encoded to match the target encoding.

def _store_data(vname: str, var: xr.Variable, index: Index, zgroup: zarr.Group) -> None:
zarr_array = zgroup[vname]
# get encoding for variable from zarr attributes
var_coded = var.copy() # copy needed for test suit to avoid modifying inputs in-place
var_coded.encoding.update(zarr_array.attrs)
var_coded.attrs = {}
var = xr.backends.zarr.encode_zarr_variable(var_coded)
data = np.asarray(var.data)
region = _region_for(var, index)
zarr_array[region] = data

@derekocallaghan
Copy link
Contributor Author

Would this only work when all input files had the same time encoding?

No, because the data are decoded before they are written, and then re-encoded to match the target encoding.

I understand it now, i.e. the specified target encoding would override any potential variation in input file encodings.

@rabernat
Copy link
Contributor

I opened #480 to track the idea of specifying encoding in the recipe. We can address that in a follow-up PR. For now this is a big step forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

2 participants