-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read grid mapping and bounds as coords #2844
Read grid mapping and bounds as coords #2844
Conversation
This does not cooperate with slicing/pulling out individual variables. `grid_mapping` should only be associated with variables that have horizontal dimensions or coordinates. `bounds` should stay associated despite having more dimensions. An alternate approach to dealing with bounds is to use a `pandas.IntervalIndex` http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex and use where the coordinate is within its cell to determine on which side the intervals are closed (`x_dim == x_dim_bnds[:, 0]` corresponds to "left", `x_dim == x_dim_bnds[:, 1]` corresponds to "right", and anything else is "neither"). This would stay through slicing and might already be used for `.groupby_bins()`, but would not generalize to boundaries of multidimensional coordinates unless someone implements a multidimensional generalization of `pd.IntervalIndex`. I do not yet know where to put tests for this. This should probably also be mentioned in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sympathetic to this change. My main concerns are mentioned inline below: the way this is currently written encoding
/attrs
are used inconsistently with other CF metadata.
Alternatively, we might only put these fields in encoding
when reading from disk, and only use encoding
when choosing how to write data. The downside is that these attributes would not be as visible in xarray's data model.
I can shift this to use encoding only, but I'm having trouble figuring out where that code would go. Would the preferred path be to create VariableCoder classes for each and add them to encode_cf_variable, then add tests to xarray.tests.test_coding? |
The current VariableCoder interface only works for coders that work at the level of individual variables. But coordinates only make sense on a dataset, so we'll need a different abstraction for that, e.g., a DatasetCoder? For now, it's probably fine to keep this logic where you have it currently. |
Discussion on GH2844 indicates that encoding is the preferred location for how things are stored on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shifted everything to use encoding
rather than encoding
and attrs
, and added enough new code that my new tests pass locally.
I'd like this one to be merged very much. Is there anything holding this back? Also, it might be nice to update the documentation with info on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a minor code suggestion, otherwise this looks good to me 👍 .
Please also add a brief note to the release notes in whats-new.rst
xarray/conventions.py
Outdated
@@ -235,6 +235,11 @@ def encode_cf_variable(var, needs_copy=True, name=None): | |||
var = maybe_default_fill_value(var) | |||
var = maybe_encode_bools(var) | |||
var = ensure_dtype_not_object(var, name=name) | |||
|
|||
if var.encoding.get('grid_mapping', None) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to supply a default value of None
with .get()
, that is already the default:
if var.encoding.get('grid_mapping', None) is not None: | |
if var.encoding.get('grid_mapping') is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually: are there any cases where someone would explicitly have grid_mapping=None
in encoding
? If not, then let's just check:
if var.encoding.get('grid_mapping', None) is not None: | |
if 'grid_mapping' in var.encoding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. That detail should go in documentation somewhere. Where would you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code change is done. Docs change is not. What page should that go on? IO/NetCDF?
I'm a little confused on why these are in encoding and not in attrs. |
This is briefly mentioned above, in If you feel strongly to the contrary, there's an idea at the top of this thread for getting For |
Switched to use Re: If so, would it be sufficient to change their |
It isn't just MetPy though. I'm sure there's existing code relying on adding
I'm +1 on this but I wonder whether saving them in We could define Currently I see What do you think? |
On 5/31/2019 11:50 AM, dcherian wrote:
It isn't just MetPy though. I'm sure there's existing code relying on
adding |grid_mapping| and |bounds| to |attrs| in order to write
CF-compliant files. So there's a (potentially big) backward
compatibility issue. This becomes worse if in the future we keep
interpreting more CF attributes and moving them to |encoding| :/.
At present, the proper, CF-compliant way to do this is to have both
|grid_mapping| and |bounds| variables in |data_vars|, and maintain the
attributes yourself, including making sure the variables get copied into
the result after relevant |ds[var_name]| and |ds.sel(axis=bounds)|
operations.
If you decide to move these variables to |coords|, the |bounds| variables will still get dropped on any subsetting operation, including those where the relevant axis was retained, the |grid_mapping| variables will be included in the result of all subsetting operations (including pulling out, for example, a time coordinate), and both will be included in some |coordinates| attribute when written to disk, breaking CF compliance.
This PR only really addresses getting these variables in |coords| initially and
keeping them out of the global |coordinates| attribute when writing to disk.
Since I'm doing this primarily to get grid_mapping and bounds
variables out of ds.data_vars.
I'm +1 on this but I wonder whether saving them in |attrs| and using
that information when encoding coordinates would be the more pragmatic
choice.
You have a point about |grid_mapping|, but applying the MetPy approach
of saving the information in another, more directly useful format
(|cartopy.Projection| instances) immediately after loading the file
would be a way around that.
For |bounds|, I think |pd.PeriodIndex| would be the most natural
representation for time, and |pd.IntervalIndex| for most other 1-D cases,
but that still leaves |bounds| for two-or-more-dimensional coordinates.
That's a design choice I'll leave to the maintainers.
We could define |encoding| as containing a specified set of CF
attributes that control on-disk representation such as |units|,
|scale_factor|, |contiguous| etc. and leaving everything else in
|attrs|. A full list of attributes that belong in |encoding| could be in
the docs so that downstream packages can fully depend on this behaviour.
Currently I see |coordinates| is interpreted and moved to |encoding|. In
the above proposal, this would be left in |attrs| but its value would
still be interpreted if |decode_coords=True|.
What do you think?
At present, |set(ds[var_name].attrs["coordinates"].split())| and
|set(ds[var_name].coords) - set(ds[var_name].indexes[dim_name])|
would be identical, since the |coordinates| attribute is essentially
computed from the second expression on write.
Do you have a use case in mind where you need specifically the list of
CF auxiliary coordinates, or is that just an example of something that
would change under the new proposal? I assume |units| would be moved to
|encoding| only for |datetime64[ns]| and |timedelta64[ns]| variables.
|
I just noticed pandas.PeriodIndex would be an alternative to pandas.IntervalIndex for time data if which side the interval is closed on is largely irrelevant for such data. Is there an interest in using these for 1D coordinates with bounds? I think |
The test failures seem to all be due to recent changes in Is sticking the |
My preference is for attrs but someone else should weigh in. cc @pydata/xarray Maybe @dopplershift , as MetPy maintainer has thoughts on this matter? |
As a downstream user, I just want to be told what to do (assuming So to clarify: is this about whether they should be in one spot or the other? Or is it about having |
I think the choice is between If it helps lean your decision one way or the other, |
Thanks for the info. Based on that, I lean towards I think a better rationale, though, would be to formalize the role of |
We could probably make it a rule that encoding gets preserved in exactly
the same way as attrs. I think there are some old issues about formalizing
metadata conventions...
…On Wed, Mar 11, 2020 at 2:19 PM Ryan May ***@***.***> wrote:
Thanks for the info. Based on that, I lean towards attrs.
I think a better rationale, though, would be to formalize the role of
encoding in xarray.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2844 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVVUMBSZNECAM74HF4TRG757BANCNFSM4HAPJB7A>
.
|
Use two backticks for monospace font. Single backticks trop you into the default role. Co-authored-by: Deepak Cherian <[email protected]>
I'm not just testing ``grid_mapping`` and ``bounds`` here, so I should have the name reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DWesl just pushed a small docs commit. I think this is ready to merge. Thank you for your patience.
Looks good to me. I was wondering where those docstrings were. |
I unfortunately have not been following along with the recent developments in this PR, so these may have already been previously covered. Sorry if that is the case! However, earlier on in the development of this PR, there were some substantial concerns about backwards compatibility (e.g., for libraries like MetPy that currently rely on |
:( yes this is currently backward incompatible (so the next release will bump major version). There's reluctance to add yet another One option is to change |
…_and_bounds_as_coords * upstream/master: (51 commits) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) ...
Updated to implement this proposal after getting OK at the previous meeting. @DWesl can you take a look at the most recent changes please? |
I think this looks good. |
Great I'll merge before the next release if no one else has comments. Thanks @DWesl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
Does anyone know why the xr.open_dataset(....)
call is echoed in the warning message. Is this intentional? Cc @dcherian @DWesl
In [4]: ds = xr.open_dataset('/home/abanihi/Downloads/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc', decode_c
...: oords="all")
<ipython-input-4-ccf90ed4a433>:1: UserWarning: Variable(s) referenced in cell_measures not in variables: ['areacella']
ds = xr.open_dataset('/home/abanihi/Downloads/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc', decode_coords="all")
In [5]: ds
Out[5]:
<xarray.Dataset>
Dimensions: (bnds: 2, lat: 145, lon: 192, time: 60)
Coordinates:
* time (time) datetime64[ns] 2010-01-16T12:00:00 ... 2014-12-16T12:00:00
time_bnds (time, bnds) datetime64[ns] ...
* lon (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
lon_bnds (lon, bnds) float64 ...
* lat (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
lat_bnds (lat, bnds) float64 ...
Dimensions without coordinates: bnds
Data variables:
pr (time, lat, lon) float32 ...
It seems you've already figured this out, but for anyone else with this question, the repeat of the call on that file is part of the warning that the file does not have all the variables the attributes refer to. You can fix this by recreating the file with the listed variables added ( |
This support was added in pydata/xarray#2844 and handles converting the grid_mapping variable to a coordinate in xarray itself, which was incompatible with some assumptions in parse_cf(). Add some handling for the case where the grid_mapping. Also add decode_coords='all' to our full test suite and adjust a few tests as necessary.
This support was added in pydata/xarray#2844 and handles converting the grid_mapping variable to a coordinate in xarray itself, which was incompatible with some assumptions in parse_cf(). Add some handling for the case where the grid_mapping. Also add decode_coords='all' to our full test suite and adjust a few tests as necessary.
I prefer having these as coordinates rather than data variables.
This does not cooperate with slicing/pulling out individual variables.
grid_mapping
should only be associated with variables that havehorizontal dimensions or coordinates.
bounds
should stay associated despite having more dimensions.I have not implemented similar functionality for the iris conversions.
An alternate approach to dealing with bounds (not used here) is to use a
pandas.IntervalIndex
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex
and use where the coordinate is within its cell to determine on which
side the intervals are closed (
x_dim == x_dim_bnds[:, 0]
correspondsto "left",
x_dim == x_dim_bnds[:, 1]
corresponds to "right", andanything else is "neither"). This would stay through slicing and
might already be used for
.groupby_bins()
, but would not generalizeto boundaries of multidimensional coordinates unless someone
implements a multidimensional generalization of
pd.IntervalIndex
whats-new.rst
for all changes andapi.rst
for new API