BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954

igorluppi · 2020-02-13T16:33:54Z

Code Sample

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]])  
# b = np.array([[.5,6],[7,8]])  # The same problem

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

print(df_new[df_new > 5])

Problem description

It has a bug that combines numpy specific values and duplicated DataFrame column names when it's used a select operation, such as df[df > 5]. A exception is thrown saying "cannot reindex from duplicate axis", however It should not be, because:

The DataFrame has no duplicated indexes ( df.index.is_unique is True)
The DataFrame has duplicated column names, but should not be a problem when we apply the selection operation, such as df_new[df_new > 5]
The DataFrame uses float or int numpy values, so it should not change the behavior of the code

However the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.

Expected Output

    0   1    0  1
0 NaN NaN  NaN  6
1 NaN NaN  7.0  8

Current Output

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-28-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

igorluppi · 2020-02-13T16:50:35Z

Moreover, doing this:

In [177]:  new_df = df.reset_index(drop=True) 
In [178]:  new_df[new_df > 10]

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

So, it's 100% sure that we have no duplicates here, so what is going on?

MarcoGorelli · 2020-02-13T17:20:08Z

Thanks @igorluppi

I just tried

df = pd.DataFrame(np.random.randn(150001, 792))
df[df>10]

and got no error - could you give us some more details about your dataframe? Do you still get the error if you only consider its head, or if you only use (say) its first 5 columns?

igorluppi · 2020-02-13T17:28:24Z

I have many dataframes, and a put all of them in a single one:

let df_items be a list of dataframes.
I got this error using:
df_final = pandas.concat(df_items, axis = 1)

However I verified that
df_final = reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)

Works fine, I got the same result DF and when I apply df_final[df_final>10] it works. But this method requires a long process to be done, concat is faster than it (at least 10 times faster).

Thanks for https://stackoverflow.com/questions/45885043/pandas-concat-cannot-reindex-from-a-duplicate-axis?rq=1 about this possible solution. But why the error happens?

MarcoGorelli · 2020-02-13T17:36:11Z

Applying df[df>10] I got "cannot reindex from duplicate axis",

I got this error using:
df_final = pandas.concat(df_items, axis = 1)

Sorry, I'm a bit confused, which command gave you the error - pd.concat or df[df>10]?

igorluppi · 2020-02-13T17:42:18Z

pd.concat gave me the df_final, this df_final got that error when I use df_final[df_final>10]
The interesting part is, when I use the reduce method and got a df_final2, this one works, in another words df_final2[df_final2>10] works fine.

Moreover,

In [14]: df_final.equals(df_final2)                                                                                                                                                                            
Out[14]: False

But I didnt find where the difference is

igorluppi · 2020-02-13T18:01:44Z

@MarcoGorelli I found the why the problem is happening but this implies in another problem regarding the exception I got. Give me a second

igorluppi · 2020-02-13T18:17:47Z

@MarcoGorelli "cannot reindex from duplicate axis" should be broken in two messages:
both "cannot reindex from duplicate index" and "cannot reindex from duplicate columns". I will explain why.

Why is that? Because all the messages and solutions I was looking for told me to took at the indexes, but in my case I found duplicated columns.

But why the second case worked? reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)
In this case, when it finds a duplicated column, automatically it appended a string "_x" to the duplicated, it became "duplicated_column_x" It's not the case for concat, it keeps the duplicated column name "duplicated_column".

My sugestion

Please change the exception, to be specific that the problem belongs to the column (or index). Just saying duplicate axis was a little bit confused to find the solution

MarcoGorelli · 2020-02-13T18:31:33Z

Thanks @igorluppi

tbh I still can't reproduce the error:

df = pd.DataFrame([0, 1], columns=['a'])
new_df = pd.concat([df, df], axis=1)
new_df[new_df>0]  # works

could you try coming up with a minimal reproducible example?

igorluppi · 2020-02-13T18:41:35Z

Ok, I will create a simple example

igorluppi · 2020-02-13T19:49:27Z

@MarcoGorelli

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]]) 
# OR
# b = np.array([[.5,6],[7,8]])

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

df_new[df_new>3]

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Basically, using .5 or 0.5 in numpy there breaks the dataframe operation. This might be a problem with pandas + numpy .

The interesting part is: Numpy float values just break the code if we have duplication on columns name.

MarcoGorelli · 2020-02-13T23:03:30Z

@igorluppi great, thanks! Could you edit this example into the original post?

igorluppi · 2020-02-14T02:35:15Z

@MarcoGorelli

For sure, it's done my friend!

igorluppi · 2020-02-17T03:08:06Z

@MarcoGorelli is it a bug ? Anything new ?

MarcoGorelli · 2020-02-17T06:01:47Z

cc @jorisvandenbossche

igorluppi · 2020-03-04T21:04:35Z

should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli

MarcoGorelli · 2020-03-06T16:30:34Z

should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli

I don't think so - I presume the core team is prioritising what'll be in the v1.0.2 release. I'm working on another issue at the moment but I plan to get back to this

igorluppi · 2020-05-11T17:44:29Z

Any news ? @MarcoGorelli @jorisvandenbossche

MarcoGorelli · 2020-05-11T17:50:02Z

I've not (yet) looked into this more, but you're welcome to submit a pull request if you like https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

Dr-Irv · 2020-09-05T13:48:51Z

This works fine in pandas 1.1.1

MarcoGorelli · 2020-09-05T14:00:19Z

This works fine in pandas 1.1.1

Any idea when it was fixed? It's probably good to make sure this was intentional and that there's a test for it...I'll do a git bisect

MarcoGorelli · 2020-09-05T14:41:01Z

If I've done git bisect correctly (which I'm not I have, see below) it looks like this was fixed in #33616

Could do with a test, so am reopening.

Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ git checkout 1c0cc62e30a3077476e97f8e7e6ba17b4ac754b6
Previous HEAD position was ad8ce0be9 CLN: Clean missing.py (#33631)
HEAD is now at 1c0cc62e3 REF: get .items out of BlockManager.apply (#33616)
(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ python setup.py build_ext -i -j 8
running build_ext
building 'pandas._libs.tslibs.nattype' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/nattype.c -o build/temp.linux-x86_64-3.8/pandas/_libs/tslibs/nattype.o -Werror
building 'pandas._libs.interval' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs -Ipandas/_libs/src/klib -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/interval.c -o build/temp.linux-x86_64-3.8/pandas/_libs/interval.o -Werror
pandas/_libs/tslibs/nattype.c:5108:18: error: ‘__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__’ defined but not used [-Werror=unused-function]
 5108 | static PyObject *__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_other) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandas/_libs/interval.c:8278:18: error: ‘__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__’ defined but not used [-Werror=unused-function]
 8278 | static PyObject *__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_y) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
cc1: all warnings being treated as errors
error: command 'gcc' failed with exit status 1

EDIT

@dsaxton I saw you've brought something similar up in the Gitter chat, were you able to resolve it?

jorisvandenbossche · 2020-09-05T14:45:07Z

@MarcoGorelli thanks for the analysis!

dsaxton · 2020-09-06T12:24:54Z

@MarcoGorelli I found that building instead with the command CFLAGS='-Wno-error=deprecated-declarations' python setup.py build_ext -i generally fixes things, although I'm not sure if it'll work in this case. There's a thread about these problems here: #33315

simonjayhawkins · 2020-09-06T18:45:38Z

Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)

I've set up a workflow for bisecting. didn't see that error but added || exit 125 to runner script to skip failed builds.

https://github.com/simonjayhawkins/pandas/runs/1078479989?check_suite_focus=true

agrees that #33616 fixed.

MarcoGorelli · 2020-09-13T16:08:24Z

I've set up a workflow for bisecting.

wow, nice!!

GabrielSimonetto · 2020-10-08T00:09:23Z

take

GabrielSimonetto · 2020-10-12T16:27:02Z

Ok, I'm stuck.

After investigating PR 33616, we check that 2 files have been changed:
pandas/core/generic.py and pandas/core/internals/managers.py, although they seem tighly correlated, and although the generic.py is directly reindexing stuff, after using some breakpoints with the following code, I've noted that the generic.py portion is not called upon.

df = pd.DataFrame([[1,2,5,6],
                    [3,4,7,8]])
df.columns=[0,1,0,1]
df[df>5]

Besides that, grepping I've found out that the exception mentioned in this issue is only raised on the function _can_reindex(), and, this function in only used on reindex_indexer() which should make it easy to debug how the error happens

(venv) [bigode@coala pandas]$ grep -r _can_reindex
core/indexes/base.py:    def _can_reindex(self, indexer):
core/internals/managers.py:            self.axes[axis]._can_reindex(indexer)

The problem is, after breakpointing both functions, they are never called on this operation! Which means, that the fix on pandas/core/internals/managers.py actively made the code avoid a section which should never get into. Which is supported by the comments @jreblack inserted:

# The caller is responsible for ensuring that
#  obj.axes[-1].equals(self.items)

I was already a bit stuck on which should be the specific test before...

(I was rehearsing something with pandas.core.internals.managers.BlockManager.{reindex_indexer, reindex_axis}, but I could not confirm they are being used since the only entrypoint I could confirm was the aforementioned internals.managers.apply(), and actually, inserting a breakpoint on reindex_indexer and reindex_axis didn't work on the test code. Which makes me think they are not being called, as absurd as that sounds),

...but now I'm completely lost. If someone could shed some light on the issue that would be awesome. Besides that, if I have some spare time I will try to use a pandas version prior to PR 33616 to see if I can pinpoint what exact interaction fixed this issue.

Dr-Irv · 2020-10-12T16:38:17Z

@GabrielSimonetto To address this issue, you only need to add a test that demonstrates that the bug was fixed. Don't worry about the internals. What happened here is that I saw the issue was fixed, and closed it, then @MarcoGorelli wanted to figure out where it was fixed, and we reopened it deciding we just needed a test to make sure that the issue is truly addressed.

MarcoGorelli · 2020-10-12T16:41:00Z

Yup 😄 @GabrielSimonetto if you wanted to submit a test to make sure this doesn't break again in the future, that would be welcome!

GabrielSimonetto · 2020-10-12T16:44:07Z

@Dr-Irv would you know where would be the right module to insert this test? If I understood correctly just a high level check will be enough?

MarcoGorelli · 2020-10-12T16:52:33Z

@GabrielSimonetto You can use the example provided by in #31954 (comment) as a test

If you open a pull request you can put it in where you think a sensible location is and if necessary we'll ask you to put it somewhere else

GabrielSimonetto · 2020-10-12T16:53:43Z

Great @MarcoGorelli! I'm on it, thanks!

igorluppi changed the title ~~"cannot reindex from duplicate axis" when I apply some operation like df[df > 10] using unique indexes~~ BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values Feb 14, 2020

Dr-Irv closed this as completed Sep 5, 2020

MarcoGorelli reopened this Sep 5, 2020

MarcoGorelli added the Needs Tests Unit test(s) needed to prevent regressions label Sep 5, 2020

MarcoGorelli added this to the Contributions Welcome milestone Sep 5, 2020

jorisvandenbossche added the good first issue label Sep 5, 2020

github-actions bot assigned GabrielSimonetto Oct 8, 2020

GabrielSimonetto mentioned this issue Oct 14, 2020

Add test_masking_duplicate_columns #37125

Closed

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Oct 16, 2020

jreback added Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 16, 2020

jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020

arw2019 mentioned this issue Dec 8, 2020

TST: a weird bug with numpy + DataFrame with duplicate columns #38354

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Dec 8, 2020

jreback closed this as completed in #38354 Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954

BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954

igorluppi commented Feb 13, 2020 •

edited

Loading

INSTALLED VERSIONS

igorluppi commented Feb 13, 2020

MarcoGorelli commented Feb 13, 2020

igorluppi commented Feb 13, 2020 •

edited

Loading

MarcoGorelli commented Feb 13, 2020

igorluppi commented Feb 13, 2020 •

edited

Loading

igorluppi commented Feb 13, 2020

igorluppi commented Feb 13, 2020 •

edited

Loading

MarcoGorelli commented Feb 13, 2020 •

edited

Loading

igorluppi commented Feb 13, 2020

igorluppi commented Feb 13, 2020 •

edited

Loading

MarcoGorelli commented Feb 13, 2020

igorluppi commented Feb 14, 2020

igorluppi commented Feb 17, 2020

MarcoGorelli commented Feb 17, 2020

igorluppi commented Mar 4, 2020

MarcoGorelli commented Mar 6, 2020

igorluppi commented May 11, 2020

MarcoGorelli commented May 11, 2020

Dr-Irv commented Sep 5, 2020

MarcoGorelli commented Sep 5, 2020

MarcoGorelli commented Sep 5, 2020 •

edited

Loading

jorisvandenbossche commented Sep 5, 2020

dsaxton commented Sep 6, 2020

simonjayhawkins commented Sep 6, 2020

MarcoGorelli commented Sep 13, 2020

GabrielSimonetto commented Oct 8, 2020

GabrielSimonetto commented Oct 12, 2020

Dr-Irv commented Oct 12, 2020

MarcoGorelli commented Oct 12, 2020

GabrielSimonetto commented Oct 12, 2020

MarcoGorelli commented Oct 12, 2020

GabrielSimonetto commented Oct 12, 2020

BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954

BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954

Comments

igorluppi commented Feb 13, 2020 • edited Loading

Code Sample

Problem description

Expected Output

Current Output

Output of pd.show_versions()

INSTALLED VERSIONS

igorluppi commented Feb 13, 2020

MarcoGorelli commented Feb 13, 2020

igorluppi commented Feb 13, 2020 • edited Loading

MarcoGorelli commented Feb 13, 2020

igorluppi commented Feb 13, 2020 • edited Loading

igorluppi commented Feb 13, 2020

igorluppi commented Feb 13, 2020 • edited Loading

MarcoGorelli commented Feb 13, 2020 • edited Loading

igorluppi commented Feb 13, 2020

igorluppi commented Feb 13, 2020 • edited Loading

MarcoGorelli commented Feb 13, 2020

igorluppi commented Feb 14, 2020

igorluppi commented Feb 17, 2020

MarcoGorelli commented Feb 17, 2020

igorluppi commented Mar 4, 2020

MarcoGorelli commented Mar 6, 2020

igorluppi commented May 11, 2020

MarcoGorelli commented May 11, 2020

Dr-Irv commented Sep 5, 2020

MarcoGorelli commented Sep 5, 2020

MarcoGorelli commented Sep 5, 2020 • edited Loading

EDIT

jorisvandenbossche commented Sep 5, 2020

dsaxton commented Sep 6, 2020

simonjayhawkins commented Sep 6, 2020

MarcoGorelli commented Sep 13, 2020

GabrielSimonetto commented Oct 8, 2020

GabrielSimonetto commented Oct 12, 2020

Dr-Irv commented Oct 12, 2020

MarcoGorelli commented Oct 12, 2020

GabrielSimonetto commented Oct 12, 2020

MarcoGorelli commented Oct 12, 2020

GabrielSimonetto commented Oct 12, 2020

igorluppi commented Feb 13, 2020 •

edited

Loading

Output of `pd.show_versions()`

igorluppi commented Feb 13, 2020 •

edited

Loading

igorluppi commented Feb 13, 2020 •

edited

Loading

igorluppi commented Feb 13, 2020 •

edited

Loading

MarcoGorelli commented Feb 13, 2020 •

edited

Loading

igorluppi commented Feb 13, 2020 •

edited

Loading

MarcoGorelli commented Sep 5, 2020 •

edited

Loading