-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954
Comments
Moreover, doing this:
So, it's 100% sure that we have no duplicates here, so what is going on? |
Thanks @igorluppi I just tried
and got no error - could you give us some more details about your dataframe? Do you still get the error if you only consider its head, or if you only use (say) its first 5 columns? |
I have many dataframes, and a put all of them in a single one: let df_items be a list of dataframes. However I verified that Works fine, I got the same result DF and when I apply Thanks for https://stackoverflow.com/questions/45885043/pandas-concat-cannot-reindex-from-a-duplicate-axis?rq=1 about this possible solution. But why the error happens? |
Sorry, I'm a bit confused, which command gave you the error - |
Moreover,
But I didnt find where the difference is |
@MarcoGorelli I found the why the problem is happening but this implies in another problem regarding the exception I got. Give me a second |
@MarcoGorelli "cannot reindex from duplicate axis" should be broken in two messages: Why is that? Because all the messages and solutions I was looking for told me to took at the indexes, but in my case I found duplicated columns. But why the second case worked? My sugestion Please change the exception, to be specific that the problem belongs to the column (or index). Just saying duplicate axis was a little bit confused to find the solution |
Thanks @igorluppi tbh I still can't reproduce the error:
could you try coming up with a minimal reproducible example? |
Ok, I will create a simple example |
Basically, using .5 or 0.5 in numpy there breaks the dataframe operation. This might be a problem with pandas + numpy . The interesting part is: Numpy float values just break the code if we have duplication on columns name. |
@igorluppi great, thanks! Could you edit this example into the original post? |
For sure, it's done my friend! |
@MarcoGorelli is it a bug ? Anything new ? |
should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli |
I don't think so - I presume the core team is prioritising what'll be in the v1.0.2 release. I'm working on another issue at the moment but I plan to get back to this |
Any news ? @MarcoGorelli @jorisvandenbossche |
I've not (yet) looked into this more, but you're welcome to submit a pull request if you like https://pandas.pydata.org/pandas-docs/stable/development/contributing.html |
This works fine in pandas 1.1.1 |
Any idea when it was fixed? It's probably good to make sure this was intentional and that there's a test for it...I'll do a git bisect |
If I've done Could do with a test, so am reopening. Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?) (pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ git checkout 1c0cc62e30a3077476e97f8e7e6ba17b4ac754b6
Previous HEAD position was ad8ce0be9 CLN: Clean missing.py (#33631)
HEAD is now at 1c0cc62e3 REF: get .items out of BlockManager.apply (#33616)
(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ python setup.py build_ext -i -j 8
running build_ext
building 'pandas._libs.tslibs.nattype' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/nattype.c -o build/temp.linux-x86_64-3.8/pandas/_libs/tslibs/nattype.o -Werror
building 'pandas._libs.interval' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs -Ipandas/_libs/src/klib -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/interval.c -o build/temp.linux-x86_64-3.8/pandas/_libs/interval.o -Werror
pandas/_libs/tslibs/nattype.c:5108:18: error: ‘__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__’ defined but not used [-Werror=unused-function]
5108 | static PyObject *__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_other) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandas/_libs/interval.c:8278:18: error: ‘__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__’ defined but not used [-Werror=unused-function]
8278 | static PyObject *__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_y) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
cc1: all warnings being treated as errors
error: command 'gcc' failed with exit status 1 EDIT@dsaxton I saw you've brought something similar up in the Gitter chat, were you able to resolve it? |
@MarcoGorelli thanks for the analysis! |
@MarcoGorelli I found that building instead with the command |
I've set up a workflow for bisecting. didn't see that error but added https://github.com/simonjayhawkins/pandas/runs/1078479989?check_suite_focus=true agrees that #33616 fixed. |
wow, nice!! |
take |
Ok, I'm stuck. After investigating PR 33616, we check that 2 files have been changed:
Besides that, grepping I've found out that the exception mentioned in this issue is only raised on the function
The problem is, after breakpointing both functions, they are never called on this operation! Which means, that the fix on
I was already a bit stuck on which should be the specific test before... (I was rehearsing something with ...but now I'm completely lost. If someone could shed some light on the issue that would be awesome. Besides that, if I have some spare time I will try to use a pandas version prior to PR 33616 to see if I can pinpoint what exact interaction fixed this issue. |
@GabrielSimonetto To address this issue, you only need to add a test that demonstrates that the bug was fixed. Don't worry about the internals. What happened here is that I saw the issue was fixed, and closed it, then @MarcoGorelli wanted to figure out where it was fixed, and we reopened it deciding we just needed a test to make sure that the issue is truly addressed. |
Yup 😄 @GabrielSimonetto if you wanted to submit a test to make sure this doesn't break again in the future, that would be welcome! |
@Dr-Irv would you know where would be the right module to insert this test? If I understood correctly just a high level check will be enough? |
@GabrielSimonetto You can use the example provided by in #31954 (comment) as a test If you open a pull request you can put it in where you think a sensible location is and if necessary we'll ask you to put it somewhere else |
Great @MarcoGorelli! I'm on it, thanks! |
Code Sample
Problem description
It has a bug that combines numpy specific values and duplicated DataFrame column names when it's used a select operation, such as
df[df > 5]
. A exception is thrown saying "cannot reindex from duplicate axis", however It should not be, because:df.index.is_unique
isTrue
)df_new[df_new > 5]
float
orint
numpy values, so it should not change the behavior of the codeHowever the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.
Expected Output
Current Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-28-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: