-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc #9258
Conversation
Hmm. I think this has all the functionality we need, but it would also be good to reconcile the API design here with |
|
I agree with @jorisvandenbossche this should just replace asof entirely |
I think you should extract the indexer into another method maybe asof_indexer (and asof just calls that) |
I'm returning an array, based on how I'm currently modeling this function as a sort of combination of |
@shoyer yes this should be 2 methods my 2c, is that we already have |
OK, latest commits include the split to 2 methods and handle missing values properly (that was not being done right before). The differences are:
I was able to make The other use of asof is in the method |
Where do you find the time man! |
One more design question: How should we handle partial string matching for The current design casts the input to a Timestamp and finds the nearest label. This seems reasonable, but it's not consistent with Here's an concrete example: idx = pd.to_datetime(['2000-01-31', '2000-02-28'])
print idx.get_loc_nearest('2000-02', side='left')
print idx.get_loc_nearest('2000-02', side='nearest') With the current implementation, each of these would print 0. This is consistent with the current behavior of |
OK, it looks like I need to make my own API decision :). I've settled on using the result of This is also a small breaking API change for Any other comments? I think this is possibly ready to merge (once rebased). |
Would it be much trouble to not have this API change in |
|
||
return label | ||
See also: get_loc_nearest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a See also
section?
See also
--------
get_loc_nearest : small explanation what this is
then you get automatic linking in the html pages
A question: when you have a partial string, what is the 'nearest' value, so what will |
@jorisvandenbossche We definitely could include a work around to avoid the change for Note the Looking into the history a little more: this test appeared in PR #8246 by @TomAugspurger. There, it appears that the issue was something else entirely -- |
@jorisvandenbossche The "nearest" value from |
I haven't really looked at this in depth - but the string slicing should be analogous to partial string indexing with the same semantics at first glance it appears that they are the same? |
@jreback Yes, the nearest partial string indexing of a DatetimeIndex is exactly equivalent to partial partial string indexing if there is a match; otherwise, it casts the string label to a Timestamp and finds the nearest value. |
@@ -1256,6 +1256,19 @@ def get_loc(self, key): | |||
except (KeyError, ValueError): | |||
raise KeyError(key) | |||
|
|||
def get_loc_nearest(self, key, side='nearest'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should also get a docstring (there are normally some decorators for adding a docstring)
@shoyer on Some other things:
I know you modelled the functions to the existing
|
I think this impl is fine. The signal of -1 is typical in the way the indexers work. These are by definition By definition these are always indexers, by position, always, full-stop. These only have a single semantic for the output results. @shoyer I would think this API simpler if you do this;
and allow I think its just think its adding API noise otherwise. These conceptually do exactly the same thing. (you certainly could have an internal |
@jreback This is a good point about incorporating this into the |
It appears that the existing |
@jorisvandenbossche I agree, the current index API is a little strange -- some refactoring to provide a unified API and to untangle I'm implementing only the low level methods now, but eventually we should figure out a higher level API for nearest neighbor lookups in pandas. |
new_index, indexer = axis_values.reindex(labels, method, level, | ||
limit=limit) | ||
return self._reindex_with_indexers( | ||
{axis: [new_index, indexer]}, method=method, fill_value=fill_value, | ||
limit=limit, copy=copy) | ||
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, method
is extraneous here (above too?)
minor comments, lgtm otherwise. |
pad / ffill: find the PREVIOUS index value if no exact match. | ||
backfill / bfill: use NEXT index value if no exact match | ||
nearest: use the NEAREST index value if no exact match. Tied | ||
distances are broken by preferring the larger index value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the html docs, you have to make bullet points of these (use -
or *
before each line) to have this render properly.
The same for the other docstrings.
docstring of reindex should still be updated |
Yep, the docstrings need a bit of work. I'll do that next week -- going camping and will be offline for the next three days. |
OK, I updated docstrings for reindex, reindex_like, get_indexer and get_loc, and they appear to generate well-formatted API docs. Merge ready, I think? |
s = Series(np.arange(10)) | ||
target = [0.1, 0.9, 1.5, 2.0] | ||
actual = s.reindex(target, method='nearest') | ||
expected = Series(np.around(target).astype(int), target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astype these to int64 otherwiset hey will fail on windows.
lgtm
|
@jreback Fixed the dtypes. @jorisvandenbossche @jreback I was planning on updating the basics.rst docs later (once "nearest" works for fillna), but it was simple enough to do it now. Let me know what you think. In particular, I removed some musing from Wes about the possibility of more sophisticated interpolation logic. |
lgtm |
@jorisvandenbossche do you want to take another look? otherwise I will merge... |
OK, I'm going to merge this. Doc issues can be dealt with in a followup PR if necessary. |
API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc
nice work @shoyer ! |
Is the intent for this to return the nearest valid observation? In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'x': range(5)})
In [4]: df.x[0] = np.nan
In [5]: df.reindex([0.2, 1.8, 3.5], method='nearest')
Out[5]:
x
0.2 NaN
1.8 2
3.5 4 I'd expect x=1 at 0.2, given the Docs. |
@mjroth the intent is for it to return values at the nearest index label, similarly to the existing behavior for forward and backward filling reindexing methods. If you don't want to include NA values, you'll need to do a dropna first. |
Fixes #8845
Currently only an index method; I'm open to ideas for how to expose it more
directly -- maybe
df.loc_nearest
?.In particular, I think this is usually a more useful way to do indexing for
floats than pandas's reindex based
.loc
, because floats are inherentlyimprecise.
CC @jreback @jorisvandenbossche @immerrr @hugadams