-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some comments
Thanks for the comments, @arw2019! I just committed the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
Might want to move the release note to 1.2 - I think 1.1.1 is intended for regressions from 1.0.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the indexing benchmarks to check for performance regressions?
@WillAyd I added a check to disambiguate between 0 and np.nan. I believe the CI failure is unrelated ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge master and a few comments.
@jbrockmendel I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged? |
I was hoping you'd port whatever parts of my PR are useful and then I'll close mine. You did most of the work here. |
Thanks for the guidance, I will definitely port your work into this PR! |
@jbrockmendel I liked the simplicity of your implementation better, so I decided to use it instead of what I currently have. I added a test case for the matching-but-not-identical nans. also, I rebased this PR so there's a cleaner commit history. |
@jreback ran the index asv: some performance decreased:
|
LGTM @alexhlim can you merge master |
LGTM pending green cc @jreback |
@jbrockmendel noticed that CI / Checks is failing:
Are these errors something that my PR caused? If not, should I wait until these errors are resolved on master and re-merge? |
Looks like those are unrelated, #42716 |
838ceb4
to
f5e6b41
Compare
@jbrockmendel @jreback pinging b/c green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cc @jreback
thanks @alexhlim very nice, will merge on green. |
thanks @alexhlim |
Thank you everyone, especially @jbrockmendel and @jreback for guiding me through this PR. I definitely learned a lot! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Looking at the implementation in index.pyx, I notice that when
[np.nan]
is passed toget_indexer_non_unique
, the code was not able to get passed the__contains__
method for stargets (line 314).Further testing (python=3.7.7, numpy=1.18.5):
Case 1 and 2 results differ because of the dtype of nan (U3 vs float64).
Upon further research, I figured out that np.nan != np.nan as per IEEE (when it is a float) and creating a set from a np.array could lead to some bizarre results (numpy/numpy#9358). Also, since a dictionary is the main data structure in this method to keep track of the targets indices, I don't think it is ideal to use nans as keys (https://stackoverflow.com/questions/6441857/nans-as-key-in-dictionaries).
I thought it would be appropriate to replace nans (with 0s) in the targets and values arrays in order to avoid the problems stated above. When considering where to replace the nans, I thought of two places where it could potentially happen:
get_indexer_non_unique
(/pandas/core/indexes/base.py)get_indexer_non_unique
(/pandas/_libs/index.pyx)Including the changes in 1. would mean overwriting the the Index object's properties, so I decided to include the changes in 2.
FYI -- I wasn't sure if the test I included was in the correct file. Please let me know if you would like this test to be in another file.