Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pint support for DataArray #3643

Merged
merged 33 commits into from
Apr 29, 2020
Merged

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Dec 18, 2019

This is part of the effort to add support for pint (see #3594) on DataArray. This depends on #3611, so we should merge that first.

  • Tests added
  • Passes black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

list of failing tests from #3594:

  • __init__: Needs unit support in IndexVariable, blocked by Explicit indexes in xarray's data-model (Future of MultiIndex) #1603
  • aggregation:
    • pint currently does not implement prod
    • calling np.median returns a numpy scalar instead of a DataArray because DataArray does not implement __array_function__
    • everything else works
  • bivariate ufuncs: fixed by ufunc calls with mixed args and upcast types hgrecco/pint#951 (but the unit won't be correct if the array is not passed first. See Make commutative operations commutative in both magnitude and units hgrecco/pint#1019)
  • numpy methods: rank only works with numpy.ndarray (uses bottleneck)
  • ffill, bfill: uses bottleneck, which does not support NEP-18?
  • where: works (but the choice of treating array(nan) and array(0) as special needs to be discussed)
  • interpolate_na: enforces the use of numpy.vectorize which does not support NEP18
  • combine_first: works, but does not test indexing (which should fail)
  • equals, identical: works (right now identical returns the same as equals)
  • drop_sel, sel, loc: indexes strip units (Explicit indexes in xarray's data-model (Future of MultiIndex) #1603)
  • interp: uses scipy.interpolate.interp1d which strips units
  • interp_like: same as interp
  • to_unstacked_dataset: blocked by IndexVariable
  • quantile: works (but needs a new pint version since it uses numpy.nanquantile
  • rolling: uses numpy.lib.stride_tricks.as_strided which is not supported by NEP18
  • rolling_exp: numbagg

@keewis keewis mentioned this pull request Dec 21, 2019
18 tasks
@keewis
Copy link
Collaborator Author

keewis commented Dec 24, 2019

While trying to get the tests to pass, I encountered a few issues:

  • the implementation of __array_ufunc__:
    for x in inputs + out:
    if not isinstance(x, self._HANDLED_TYPES + (SupportsArithmetic,)):
    return NotImplemented
    checks if it handles all types. However, the way it is implemented will cause it to return NotImplemented for duck arrays, even if it would be fine to treat them the same as ndarray. Adding and not hasattr(x, "__array_function__") will make my tests pass, but I thought I'd ask first.
  • I've been testing if the aggregation functions work, both as method and as function. This works fine for mean but not for median: calling np.median(da) returns a numpy scalar instead of the data array wrapping a 0d array returned by da.median(). I can't seem to find the reason why this happens, mostly because I don't know where to look.

@keewis
Copy link
Collaborator Author

keewis commented Dec 25, 2019

The aggregation tests fail because numpy.median needs __array_function__ to be implemented, but I think that's out of scope for this PR. I marked it as xfailing.

The bivariate ufunc test is fixed by hgrecco/pint#951.

shift could be fixed by using

        filler = np.full_like(np.resize(trimmed_data, shape), fill_value, dtype=dtype)

instead of

xarray/xarray/core/variable.py

Lines 1105 to 1112 in 651f27f

if isinstance(trimmed_data, dask_array_type):
chunks = list(trimmed_data.chunks)
chunks[axis] = (shape[axis],)
full = functools.partial(da.full, chunks=chunks)
else:
full = np.full
filler = full(shape, fill_value, dtype=dtype)
but I didn't check if that would be a bad for performance (and pint would have to implement np.resize). However, that would not work for dask arrays so I'm not sure what to do.

rolling uses bottleneck which does not play nice with pint, yet. I'd suggest using

       	if bottleneck_move_func	is not None and not isinstance(
	    self.obj.data, dask_array_type
	) and not hasattr(self.obj.data, "__array_function__"):

instead of

if bottleneck_move_func is not None and not isinstance(
self.obj.data, dask_array_type
):
If we go with that, we'd need pint to implement np.pad.

Edit: I added it, but this still requires np.pad. However, it makes a lot of other, unrelated tests fail, so I removed the commit again.

Unless I missed something, these should be the only issues left for pint support in DataArray (not counting the IndexVariable issue).

@keewis keewis force-pushed the pint-support-dataarray branch from 4863205 to 61c3eb9 Compare December 26, 2019 18:08
@keewis
Copy link
Collaborator Author

keewis commented Dec 26, 2019

@jthielen, it looks like this needs numpy.pad and numpy.resize in pint. I tried to implement it myself, but while numpy.resize is easy (add an entry to the implement_consistent_units_by_argument loop), that is not the case for numpy.pad. Regardless of whether my proposed fixes are actually the best way, numpy.pad is used in other places of the code base so we'd probably need both, anyway. What do you think?

@jthielen
Copy link
Contributor

@jthielen, it looks like this needs numpy.pad and numpy.resize in pint. I tried to implement it myself, but while numpy.resize is easy (add an entry to the implement_consistent_units_by_argument loop), that is not the case for numpy.pad. Regardless of whether my proposed fixes are actually the best way, numpy.pad is used in other places of the code base so we'd probably need both, anyway. What do you think?

No problem at all to add these! See hgrecco/pint#956.

bors bot added a commit to hgrecco/pint that referenced this pull request Dec 27, 2019
956: Add np.pad and np.resize implementations r=hgrecco a=jthielen

As requested in pydata/xarray#3643 (comment), this PR adds implementations for `np.pad` and `np.resize`, with accompanying tests.

- ~~Closes (insert issue number)~~ (#905 follow-up)
- [x] Executed ``black -t py36 . && isort -rc . && flake8`` with no errors
- [x] The change is fully covered by automated unit tests
- [x] Documented in docs/ as appropriate
- ~~Added an entry to the CHANGES file~~ (#905 follow-up)


Co-authored-by: Jon Thielen <[email protected]>
@keewis keewis mentioned this pull request Dec 28, 2019
2 tasks
@crusaderky
Copy link
Contributor

if you merge from master you should get all green on the upstream-dev test now

@keewis
Copy link
Collaborator Author

keewis commented Jan 30, 2020

most of these are failing tests related to this PR but it is on hold until #3706 and #3611 (in that order) have been merged.

xarray/core/variable.py Outdated Show resolved Hide resolved
@keewis keewis force-pushed the pint-support-dataarray branch from 54e6fde to 308df31 Compare March 11, 2020 22:13
@keewis
Copy link
Collaborator Author

keewis commented Mar 22, 2020

A few updates on the progress here:

The implementation of interpolate_na uses apply_ufunc with vectorize=True which in turn uses numpy.vectorize. numpy.vectorize is explicitly mentioned as deferred in NEP18 (i.e. not supported). I guess we can't support interpolate_na unless someone knows a way around that? Passing vectorize=False to apply_ufunc does make it work, though.

The failure of test_bivariate_ufunc is due to the way pint infers units, but whether or not that should matter was discussed in #3706 (but we didn't really reach a consensus there, I think?).

interp does work up until it calls scipy.interpolate.interp1d, which seems to strip the units?

reindex works for data, but the tests need a rewrite so they also test coords / dims with units (the same for the interp tests).

@max-sixty
Copy link
Collaborator

LGTM! Thanks as ever @keewis

@max-sixty
Copy link
Collaborator

not sure what the bot is checking, xarray/core/arithmetic.py doesn't have 1934 lines (the last line is 107) and black doesn't complain about indentation for any line

I put in an issue

@keewis
Copy link
Collaborator Author

keewis commented Apr 2, 2020

I put in an issue

thanks, @max-sixty.

@keewis
Copy link
Collaborator Author

keewis commented Apr 2, 2020

it seems we can't use __array_function__ to check if the type should be handled or not (using __array_function__ works, but that's by accident, I think, and stops working once/if we implement __array_function__on xarray objects).

I'd say it is pretty rare the other type knows what to do with a xarray object, so I think we should retry using the wrapped object. I don't know the reason for the current implementation, though, so I could be missing something. For reference, this only is a problem with functions that accept multiple values, like numpy.maximum. Any thoughts on this, @shoyer?

Edit: actually, that smells like a type hierarchy issue, as discussed in hgrecco/pint#845 and dask/dask#5329

@keewis
Copy link
Collaborator Author

keewis commented Apr 6, 2020

Ugh. Trying to add pint.Quantityto the list of handled types does not work since pint tries to add xarray to its known non-handled types – that means we have a circular import. I guess I'll xfail the bivariate_ufuncs tests, too.

To resolve this, the type registration mentioned in #1617 and

# TODO: allow extending this with some sort of registration system

might help?

@keewis keewis force-pushed the pint-support-dataarray branch from 2031b52 to 79430f0 Compare April 6, 2020 13:53
@dcherian
Copy link
Contributor

dcherian commented Apr 6, 2020

pint tries to add xarray to its known non-handled types

I don't know much about this stuff can't pint have just a list of handled types and return NotImplementedfor anything else?

@keewis
Copy link
Collaborator Author

keewis commented Apr 6, 2020

see dask/dask#5329 (comment). Basically, pint is high enough in the type hierarchy that there are too many types below it and it is easier for it to list those above it.

Actually, pint and xarray using different (conflicting) ways to check the support was the reason why I asked for advice in that issue.

@TomNicholas TomNicholas added the topic-arrays related to flexible array support label Apr 6, 2020
@jthielen
Copy link
Contributor

jthielen commented Apr 7, 2020

@keewis Is part of the specific problem that Pint has a guarded import of xarray when defining its upcast types? Would it help if it checked for the fully qualified class name instead?

@jthielen
Copy link
Contributor

jthielen commented Apr 7, 2020

Also, after glancing through all this, it seems like xarray is dealing inconsistently with the type casting hierarchy:

  • Construction/wrapping allowing __array_function__-defining duck arrays among other explictly listed additions (pandas.Index, dask arrays, etc., see
    def as_compatible_data(data, fastpath=False):
    )
  • Binary ops only denying based on xarray's internal hierarchy (Dataset, DataArray, Variable), otherwise deferring to underlying data
  • __array_ufunc__ allowing a list of supported types (
    _HANDLED_TYPES = (
    np.ndarray,
    np.generic,
    numbers.Number,
    bytes,
    str,
    ) + dask_array_type
    along with SupportsArithmetic)
  • (__array_function__ yet to be defined)

Should this be its own issue?

@keewis
Copy link
Collaborator Author

keewis commented Apr 7, 2020

Is part of the specific problem that Pint has a guarded import of xarray when defining its upcast types? Would it help if it checked for the fully qualified class name instead?

SupportsArithmetic is a base class of Variable, DataArray and Dataset, so if I try to import pint for type checking, pint can't use these classes, otherwise we'd have a circular import. Comparing fully qualified class names would work, but it might make sense to wait for a recommendation regarding my question.

Should this be its own issue?

Definitely, even if it's just to clarify why we accept different types in those functions: I think __init__ may convert its arguments so it supports more types than __array_ufunc__ (I don't know about the others, though). For the missing __array_function__ see #3917.

@dcherian
Copy link
Contributor

IMO @keewis you should merge this if you think its ready.

even if it's just to clarify why we accept different types in those functions: I think init may convert its arguments so it supports more types than array_ufunc (I don't know about the others, though).

open an issue for this?

@keewis
Copy link
Collaborator Author

keewis commented Apr 29, 2020

see #3950

Merging: sure, I just wanted to avoid merging my own PRs without a final review

@keewis keewis merged commit 3820fb7 into pydata:master Apr 29, 2020
@keewis keewis deleted the pint-support-dataarray branch April 29, 2020 16:12
dcherian added a commit to dcherian/xarray that referenced this pull request May 1, 2020
* upstream/master: (39 commits)
  Pint support for DataArray (pydata#3643)
  Apply blackdoc to the documentation (pydata#4012)
  ensure Variable._repr_html_ works (pydata#3973)
  Fix handling of abbreviated units like msec (pydata#3998)
  full_like: error on non-scalar fill_value (pydata#3979)
  Fix some code quality and bug-risk issues (pydata#3999)
  DOC: add pandas.DataFrame.to_xarray (pydata#3994)
  Better chunking error messages for zarr backend (pydata#3983)
  Silence sphinx warnings (pydata#3990)
  Fix distributed tests on upstream-dev (pydata#3989)
  Add multi-dimensional extrapolation example and mention different behavior of kwargs in interp (pydata#3956)
  keep attrs in interpolate_na (pydata#3970)
  actually use preformatted text in the details summary (pydata#3978)
  facetgrid: Ensure that colormap params are only determined once. (pydata#3915)
  RasterioDeprecationWarning (pydata#3964)
  Empty line missing for DataArray.assign_coords doc (pydata#3963)
  New coords to existing dim (doc) (pydata#3958)
  implement a more threadsafe call to colorbar (pydata#3944)
  Fix wrong order of coordinate converted from pd.series with MultiIndex (pydata#3953)
  Updated list of core developers (pydata#3943)
  ...
dcherian added a commit to kmuehlbauer/xarray that referenced this pull request May 9, 2020
…k-issues

* upstream/master: (22 commits)
  support darkmode (pydata#4036)
  Use literal syntax instead of function calls to create the data structure (pydata#4038)
  Add template xarray object kwarg to map_blocks (pydata#3816)
  Transpose coords by default (pydata#3824)
  Remove broken test for Panel with to_pandas() (pydata#4028)
  Allow warning with cartopy in docs plotting build (pydata#4032)
  Support overriding existing variables in to_zarr() without appending (pydata#4029)
  chore: Remove unnecessary comprehension (pydata#4026)
  fix to_netcdf docstring typo (pydata#4021)
  Pint support for DataArray (pydata#3643)
  Apply blackdoc to the documentation (pydata#4012)
  ensure Variable._repr_html_ works (pydata#3973)
  Fix handling of abbreviated units like msec (pydata#3998)
  full_like: error on non-scalar fill_value (pydata#3979)
  Fix some code quality and bug-risk issues (pydata#3999)
  DOC: add pandas.DataFrame.to_xarray (pydata#3994)
  Better chunking error messages for zarr backend (pydata#3983)
  Silence sphinx warnings (pydata#3990)
  Fix distributed tests on upstream-dev (pydata#3989)
  Add multi-dimensional extrapolation example and mention different behavior of kwargs in interp (pydata#3956)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants