BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498

alexhlim · 2020-07-31T19:24:51Z

closes BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan #35392
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Looking at the implementation in index.pyx, I notice that when [np.nan] is passed to get_indexer_non_unique, the code was not able to get passed the __contains__ method for stargets (line 314).

Further testing (python=3.7.7, numpy=1.18.5):

import numpy as np

# Case 1: Does not work -> prints nothing
# nan dtype: np.float64, ndarray dtype: np.float64
targets = np.array([np.nan])

# Case 2: Works -> prints 0, 1, 2
# nan dtype: U3, ndarray dtype: <U32
targets = np.array([np.nan, 'var1'])

values = np.array([np.nan, 'var1', np.nan])

stargets = set(targets)

for i, v in enumerate(values):
    if v in stargets:
        print(i)

Case 1 and 2 results differ because of the dtype of nan (U3 vs float64).

Upon further research, I figured out that np.nan != np.nan as per IEEE (when it is a float) and creating a set from a np.array could lead to some bizarre results (numpy/numpy#9358). Also, since a dictionary is the main data structure in this method to keep track of the targets indices, I don't think it is ideal to use nans as keys (https://stackoverflow.com/questions/6441857/nans-as-key-in-dictionaries).

I thought it would be appropriate to replace nans (with 0s) in the targets and values arrays in order to avoid the problems stated above. When considering where to replace the nans, I thought of two places where it could potentially happen:

In get_indexer_non_unique (/pandas/core/indexes/base.py)
In get_indexer_non_unique (/pandas/_libs/index.pyx)

Including the changes in 1. would mean overwriting the the Index object's properties, so I decided to include the changes in 2.

FYI -- I wasn't sure if the test I included was in the correct file. Please let me know if you would like this test to be in another file.

arw2019

looks good, some comments

pandas/_libs/index.pyx

pandas/tests/base/test_misc.py

…andas-dev#35498)

alexhlim · 2020-07-31T21:18:51Z

Thanks for the comments, @arw2019! I just committed the changes.

arw2019

looks good

Might want to move the release note to 1.2 - I think 1.1.1 is intended for regressions from 1.0.5

WillAyd

Can you run the indexing benchmarks to check for performance regressions?

pandas/_libs/index.pyx

pandas/tests/indexes/test_base.py

pandas/_libs/index.pyx

pandas/tests/indexes/test_base.py

…-dev#35498)

)

pandas/_libs/index.pyx

pandas/tests/indexes/test_indexing.py

alexhlim · 2020-08-22T20:47:07Z

@WillAyd I added a check to disambiguate between 0 and np.nan. I believe the CI failure is unrelated (test_s3_roundtrip_for_dir). Let me know what you think!

jreback

can you merge master and a few comments.

pandas/_libs/index.pyx

pandas/tests/indexes/test_indexing.py

pandas/_libs/index_class_helper.pxi.in

alexhlim · 2021-06-29T03:26:50Z

@jbrockmendel I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged?

jbrockmendel · 2021-06-29T15:19:06Z

I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged?

I was hoping you'd port whatever parts of my PR are useful and then I'll close mine. You did most of the work here.

alexhlim · 2021-06-30T04:55:42Z

I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged?

I was hoping you'd port whatever parts of my PR are useful and then I'll close mine. You did most of the work here.

Thanks for the guidance, I will definitely port your work into this PR!

…iple nan (pandas-dev#35392)

alexhlim · 2021-07-03T17:20:26Z

@jbrockmendel I liked the simplicity of your implementation better, so I decided to use it instead of what I currently have. I added a test case for the matching-but-not-identical nans. also, I rebased this PR so there's a cleaner commit history.

alexhlim · 2021-07-03T18:47:36Z

@jreback ran the index asv: some performance decreased:

asv continuous -f 1.1 upstream/master HEAD -b ^indexing

       before           after         ratio
     [5675cd8a]       [85dd25ce]
     <master>         <nui-regression>
+        81.5±2ms          107±2ms     1.31  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+        80.4±1ms          105±3ms     1.31  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+        81.8±2ms          106±1ms     1.29  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+      83.3±0.9ms        105±0.9ms     1.26  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+        81.9±2ms          101±2ms     1.24  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+      8.14±0.3ms       10.0±0.2ms     1.23  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+        88.4±2ms          106±2ms     1.19  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        88.5±1ms          106±1ms     1.19  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+         128±2μs          143±3μs     1.12  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

jbrockmendel · 2021-07-24T22:25:09Z

LGTM @alexhlim can you merge master

jbrockmendel · 2021-07-26T00:25:50Z

LGTM pending green cc @jreback

alexhlim · 2021-07-26T00:59:04Z

@jbrockmendel noticed that CI / Checks is failing:

[doctest] pandas.core.generic.NDFrame.to_xarray
Typing validation

Are these errors something that my PR caused? If not, should I wait until these errors are resolved on master and re-merge?

jbrockmendel · 2021-07-26T03:22:40Z

Looks like those are unrelated, #42716

alexhlim · 2021-08-03T01:18:29Z

@jbrockmendel @jreback pinging b/c green

jbrockmendel

LGTM cc @jreback

jreback · 2021-08-04T23:54:32Z

thanks @alexhlim very nice, will merge on green.

jreback · 2021-08-05T11:51:21Z

thanks @alexhlim

alexhlim · 2021-08-05T13:07:50Z

Thank you everyone, especially @jbrockmendel and @jreback for guiding me through this PR. I definitely learned a lot!

…iple nan (pandas-dev#35392) (pandas-dev#35498)

arw2019 suggested changes Jul 31, 2020

View reviewed changes

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

pandas/tests/base/test_misc.py Outdated Show resolved Hide resolved

alexhlim added a commit to alexhlim/pandas that referenced this pull request Jul 31, 2020

Removing comment, moving test_get_indexer_non_unique_multiple_nans (p…

f002d60

…andas-dev#35498)

arw2019 approved these changes Jul 31, 2020

View reviewed changes

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 1, 2020

Pulling master, moved whatsnew from v1.1.0 to v1.2.0 (pandas-dev#35498)

494f507

WillAyd requested changes Aug 4, 2020

View reviewed changes

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

WillAyd added the Indexing Related to indexing on series/frames, not to indexes themselves label Aug 4, 2020

jreback requested changes Aug 5, 2020

View reviewed changes

pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020

Using isnaobj instead of list comprehension (pandas-dev#35498)

e6a00aa

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020

TST: Added object array and NaT tests, moved to test_indexing (pandas…

2bea731

…-dev#35498)

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020

CLN: Masking array for nan check, changing array dtype (pandas-dev#35498

e083962

)

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020

Merging master (pandas-dev#35498)

f14a9e5

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020

DOC: adding e to get_indexer_for (pandas-dev#35498)

d902013

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020

CLN: np.object to object (pandas-dev#35498)

3367203

alexhlim requested review from jreback and WillAyd August 16, 2020 14:10

WillAyd requested changes Aug 18, 2020

View reviewed changes

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 18, 2020

CLN: Using masks instead of list comprehensions (pandas-dev#35498)

ca82d69

alexhlim requested a review from WillAyd August 18, 2020 15:53

WillAyd requested changes Aug 19, 2020

View reviewed changes

pandas/tests/indexes/test_indexing.py Outdated Show resolved Hide resolved

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 21, 2020

CLN: differentiate 0 and np.nan (pandas-dev#35498)

4f9d9df

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 21, 2020

TST: differentiate 0 and np.nan (pandas-dev#35498)

9338d3b

alexhlim requested a review from WillAyd August 22, 2020 20:47

simonjayhawkins added the Needs Review label Sep 7, 2020

jreback requested changes Sep 13, 2020

View reviewed changes

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

pandas/tests/indexes/test_indexing.py Outdated Show resolved Hide resolved

alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020

CLN: making copies of targets and values (pandas-dev#35498)

7fcb14f

alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020

CLN: using -1 instead of 0 for nan replacements (pandas-dev#35498)

834a715

alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020

TST: using -1 instead of 0 for nan replacements (pandas-dev#35498)

89137ee

alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020

Merging master (pandas-dev#35498)

7b14cf6

jbrockmendel reviewed Jun 28, 2021

View reviewed changes

pandas/_libs/index_class_helper.pxi.in Outdated Show resolved Hide resolved

This was referenced Jun 28, 2021

BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan #35392

Closed

BUG: get_indexer_non_unique with np.nan #42289

Closed

BUG: Index.get_indexer_non_unique misbehaves when index contains mult…

85dd25c

…iple nan (pandas-dev#35392)

alexhlim force-pushed the nui-regression branch from a831aa4 to 85dd25c Compare July 2, 2021 22:26

merging master

11a3790

merging master

f5e6b41

alexhlim force-pushed the nui-regression branch from 838ceb4 to f5e6b41 Compare August 2, 2021 22:14

jbrockmendel approved these changes Aug 3, 2021

View reviewed changes

jreback added this to the 1.4 milestone Aug 4, 2021

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 4, 2021

jreback approved these changes Aug 4, 2021

View reviewed changes

Merge branch 'master' into nui-regression

b2d3dbb

jreback merged commit bf267b4 into pandas-dev:master Aug 5, 2021

alexhlim deleted the nui-regression branch August 5, 2021 13:07

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

BUG: Index.get_indexer_non_unique misbehaves when index contains mult…

88c999c

…iple nan (pandas-dev#35392) (pandas-dev#35498)

alexhlim mentioned this pull request Oct 4, 2021

BUG: get_indexer_non_unique does not handle np.datetime64("NaT") and np.timedelta64("NaT") #43869

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498

BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498

alexhlim commented Jul 31, 2020

arw2019 left a comment

alexhlim commented Jul 31, 2020

arw2019 left a comment

WillAyd left a comment

alexhlim commented Aug 22, 2020 •

edited

Loading

jreback left a comment

alexhlim commented Jun 29, 2021

jbrockmendel commented Jun 29, 2021

alexhlim commented Jun 30, 2021

alexhlim commented Jul 3, 2021

alexhlim commented Jul 3, 2021

jbrockmendel commented Jul 24, 2021

jbrockmendel commented Jul 26, 2021

alexhlim commented Jul 26, 2021

jbrockmendel commented Jul 26, 2021

alexhlim commented Aug 3, 2021

jbrockmendel left a comment

jreback commented Aug 4, 2021

jreback commented Aug 5, 2021

alexhlim commented Aug 5, 2021

BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498

BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498

Conversation

alexhlim commented Jul 31, 2020

arw2019 left a comment

Choose a reason for hiding this comment

alexhlim commented Jul 31, 2020

arw2019 left a comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

alexhlim commented Aug 22, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

alexhlim commented Jun 29, 2021

jbrockmendel commented Jun 29, 2021

alexhlim commented Jun 30, 2021

alexhlim commented Jul 3, 2021

alexhlim commented Jul 3, 2021

jbrockmendel commented Jul 24, 2021

jbrockmendel commented Jul 26, 2021

alexhlim commented Jul 26, 2021

jbrockmendel commented Jul 26, 2021

alexhlim commented Aug 3, 2021

jbrockmendel left a comment

Choose a reason for hiding this comment

jreback commented Aug 4, 2021

jreback commented Aug 5, 2021

alexhlim commented Aug 5, 2021

alexhlim commented Aug 22, 2020 •

edited

Loading