Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should sel with slice objects care about underlying coordinate order? #1613

Open
leeviannala opened this issue Oct 6, 2017 · 22 comments
Open

Comments

@leeviannala
Copy link
Contributor

If I have a xarray DataArray, that has x-coordinates in descending order, and I try to crop the DataArray using
da.sel(x=slice(10,50)), it gives me empty DataArray. If I use
da.sel(x=slice(50,10)), it finds everything it is supposed to find.

I think it would be useful to be able to use this feature without knowing the exact order of the DataArray. Is this intentional?

@shoyer
Copy link
Member

shoyer commented Oct 9, 2017

This was actually intentional, though you are certainly not the first person to complain about this.

Xarray's behavior mirrors pandas. The justification for this logic is that sel/loc would work like normal indexing [] on a Python list, and if you supply slice limits in the opposite order on a Python list, you get an empty result.

Probably the cleanest solution possible would be a separate selection object for xarray that doesn't care about order, e.g., da.sel(x=xarray.Between(10, 50)).

@rgommers
Copy link

rgommers commented Sep 10, 2018

The justification for this logic is that sel/loc would work like normal indexing [] on a Python list, and if you supply slice limits in the opposite order on a Python list, you get an empty result.

Given that sel is label/value-based, I don't see how the analogy with positional list-based indexing applies. This behavior looks like a functional bug. If you have decided to keep it as is, I'd suggest at the very least to add a warning if an empty dataset is returned because values in an attribute are ordered high...low.

EDIT: also then best to close this issue as wontfix

@shoyer
Copy link
Member

shoyer commented Sep 11, 2018

To clarify: xarray behavior doesn't just mirror pandas here -- it actually directly uses pandas to figure out the mapping from labeled slices to integer slices.

So I'm not opposed to changing this (I agree this trips people up often) but we should try to change it upstream if possible.

@rgommers
Copy link

Ah okay, that makes sense. I'm sure there's a related pandas issue (or many), will try to find that later.

@rgommers
Copy link

The only related issues I can find are:

They don't look identical though. Don't really have the time to dive into that further now.

@rgommers
Copy link

In case it helps anyone else, I ended up doing:

    # Note that xarray is fiddly with indexing - if x or y values are ordered
    # high to low, then the slice bounds need to be reversed.  So check that
    x_ordered_low2high = data.x.values[-1] - data.x.values[0] > 0
    y_ordered_low2high = data.y.values[-1] - data.y.values[0] > 0

    if x_ordered_low2high:
        x_index = slice(lower_lon, upper_lon)
    else:
        x_index = slice(upper_lon, lower_lon)

    if y_ordered_low2high:
        y_index = slice(lower_lat, upper_lat)
    else:
        y_index = slice(upper_lat, lower_lat)

    subset = data.sel(x=x_index, y=y_index)

@shoyer
Copy link
Member

shoyer commented Sep 17, 2018

Consider using index.is_monotonic_ascending and index.is_monotonic_descending instead of subtracting the first few values -- those are exactly the checks that pandas uses.

I also did a little more searching and found that this has come up in the xarray issue tracker before: #1465


Thinking about this a little more, I'm remembering the reasoning for pandas working this way:

  1. Integer slicing by position and by label with positive integers should work the same for an index with values given by range(N).
  2. Positional indexing in Python x[a:b] returns all values between a and b in order. Indexing like x[high:low] with high > low returns an empty list.
  3. Label based indexing in pandas has df.loc[df.index[i] : df.index[j]] equivalent to df.iloc[i : j+1] (arguably this is a mistake; e.g., Slicing inconsistent for coordinates and dimensions without coordinates #1492 and other discussion that I can't find right now :) ).
  4. Indexing a monotonic decreasing index therefore should work for decreasing labels.
  5. If slicing a monotonic increasing index only works for increasing but not decreasing labels, then it would be weird if indexing a monotonic decreasing index works for both increasing and decreasing labels.

@rgommers
Copy link

Consider using index.is_monotonic_ascending and index.is_monotonic_descending instead of subtracting the first few values -- those are exactly the checks that pandas uses.

Thanks, that's nicer, will do. And thanks for digging up the background/rationales.

(1)Integer slicing by position and by label with positive integers should work the same for an index with values given by range(N).

This I don't think I agree with. Slicing by position and by label are semantically very different operations.

(2) is correct, but irrelevant to label-based indexing.

(3) yes, agree that's a mistake

(4) indeed

(5) I'd say that it's in practice less important, because users normally won't do slice(high, low), but for consistency I agree that there should be symmetry in behaviour. Making slice(high, low) return the same as slice(low, high) on a monotonic increasing index seems reasonable.

Additionally: arguably monotonicity should not be required. When one uses labels, the semantics are clear without monotonicity. This doesn't have a position-based equivalent.

@shoyer
Copy link
Member

shoyer commented Sep 17, 2018

Additionally: arguably monotonicity should not be required. When one uses labels, the semantics are clear without monotonicity.

Requiring monotonicity lets us guarantee that the result is a NumPy view rather than possibly being a copy.

@rgommers
Copy link

Requiring monotonicity lets us guarantee that the result is a NumPy view rather than possibly being a copy.

Sure, but if a users happens to have non-monotonic data it just requires her to then make that copy first anyway. Still a good thing overall for performance, but there'll be cases where it's just an extra thing to understand for the user without any performance gain.

Anyway, the non-monotonic case is less relevant, because it's harder to run into in practice. The decreasing case however is easy - there is standard geo software (looking at you ArcGIS) that can write geoTiff's with monotonic decreasing indices. That's how I ran into this. Rewriting multi-GB source data that I didn't produce is not an option, so I'm left with the manual monotonicity checks and juggling label-based slices.

@fmaussion
Copy link
Member

there is standard geo software (looking at you ArcGIS) that can write geoTiff's with monotonic decreasing indices

This is very frequent for climate models also (lat arrays are frequently upside down). An xarray (or upstream) solution would be very useful indeed, but possibly there are other things to consider, like being "too magical".

xarray makes a very good job at handling decreasing coordinates (i.e. when plotting), but soon or late users might need to know how they data is ordered, and indexing might be the right time to warn them about it when they do something wrong: usually an empty array is unlikely to be left unnoticed, so that the current behavior might help to find bugs in user code.

@shoyer
Copy link
Member

shoyer commented Sep 18, 2018

Making slice(high, low) return the same as slice(low, high) on a monotonic increasing index seems reasonable.

Agreed, I don't think many pandas users would object to this.

@stale
Copy link

stale bot commented Aug 22, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Aug 22, 2020
@stale stale bot closed this as completed Sep 22, 2020
@dcherian dcherian reopened this Sep 22, 2020
@stale stale bot removed the stale label Sep 22, 2020
@dcherian dcherian changed the title Improvement suggestion for sel Should sel with slice objects care about underlying coordinate order? Mar 3, 2022
@shoyer
Copy link
Member

shoyer commented Mar 3, 2022

This is probably worth fixing if possible in a straightforward way. I don't think anyone is well served by matching the behavior of Python list indexing here -- it's a strange edge that case that indexing a list like x[5:0] returns an empty list.

@dcherian
Copy link
Contributor

dcherian commented Mar 3, 2022

What would be a good solution?

  1. Make sel implicitly ignore order for slice objects.
  2. Add a option to sel for explicitly ignoring slice order.
    1. add a new kwarg
    2. Add a new xr.Between object

@brianmapes
Copy link

brianmapes commented Mar 3, 2022

How about: add one code line at the top of slice that properly orders the exactly two arguments (in other words, reverses the order if necessary)? I haven't studied the code at all, this may be naive. This may the same as 1. above (make sel ignore order). Ideally the default would be silently improved, rather than creating something optional that users would have to make a new habit of, like an additional kwarg or new syntax.

@shoyer
Copy link
Member

shoyer commented Mar 3, 2022

One complication with using sel() with slice objects is that you can do selection over non-monotonic indexes, merely based on matching bounds:

>>> data = xarray.DataArray([1, 2, 3, 4, 5], dims=['x'], coords=[[5, 1, 4, 3, 2]])
>>> data
<xarray.DataArray (x: 5)>
array([1, 2, 3, 4, 5])
Coordinates:
  * x        (x) int64 5 1 4 3 2
>>> data.sel(x=slice(1, 3)))
<xarray.DataArray (x: 3)>
array([2, 3, 4])
Coordinates:
  * x        (x) int64 1 4 3

If we change the semantics of slice in sel() to do filtering rather than be concerned about order (which does seem much less useful), we should probably deprecate the handling of non-monotonic ascending or descending indexes.

Alternatively, we could either do the dedicated indexing object like xarray.Between(lower, upper) or have a dedicated method for selecting between values, e.g., perhaps data.sel_between(x=(1, 3)) or data.sel_bounds(x=(1, 3)).

@brianmapes
Copy link

@shoyer, selection over non-monotonic indexes, merely based on matching bounds is a case I never considered... probably a rare one, which I for one would happily sacrifice.

New syntax like your first suggestion, .Between instead of .slice if I understand correctly, could be appealing too: The word 'slice' has never seemed terribly intuitive to this user, and 'between' is high in an alphabetical list of tips/suggestions a good IDE might bring up with the tab key or whatever.

@brianmapes
Copy link

How about an informative warning when returning an empty set, as a minimal change? What's the path to an action item here?

@shoyer
Copy link
Member

shoyer commented Mar 8, 2022

Another challenge with changing the meaning of slice is handling partial slices, e.g., what does slice(500, None) mean? With a monotonic decreasing index, that would select values below 500, but ignoring underlying coordinate order it would presumably mean selecting values above 500.

I think the separate new API (e.g., xarray.Between or .sel_between()) is probably a better idea.

@brianmapes
Copy link

Agreed, new rather than redefine existing meanings.
Who has the chops and the cred to implement it? Not quite me...

@benbovy
Copy link
Member

benbovy commented Aug 30, 2023

Alternatively, we could either do the dedicated indexing object like xarray.Between(lower, upper) or have a dedicated method for selecting between values, e.g., perhaps data.sel_between(x=(1, 3)) or data.sel_bounds(x=(1, 3)).

xref #7099 about generic API for indexing objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants