Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.sort_index not sorting #55379

Closed
2 of 3 tasks
caneff opened this issue Oct 3, 2023 · 5 comments
Closed
2 of 3 tasks

BUG: df.sort_index not sorting #55379

caneff opened this issue Oct 3, 2023 · 5 comments
Labels
Bug MultiIndex Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@caneff
Copy link
Contributor

caneff commented Oct 3, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
ix = pd.MultiIndex.from_tuples([('a', '10'), ('a', '18'), ('a', '25'), ('b', '16'), ('b', '26'), ('a', '45'), ('b', '28'), ('a', '5'), ('a', '50'), ('a', '51'), ('b', '4'), ('b', '49'), ('a', '78'), ('a', '81'), ('a', '85'),('b', '67'),('b', '74'), ('b', '77'), ('b', '83'),('b', '97')], names=['group', 'str'])

df = pd.DataFrame({'x':range(len(ix))},index=ix)
df.sort_index() # Works

df2 = df.iloc[0:6]
df2.sort_index() # Doesn't sort right
df2.index.sort_values() # Sorts correctly

df3 = df.iloc[0:7]
df3.sort_index() # Works!

df4 = df.iloc[1:7]
df4.sort_index() # Doesn't sort right

Issue Description

df2.sort_index() should sort but it doesn't. It doesn't leave things the way it is either though it does a weird half sort.

I get:

In [30]: df2
Out[30]: 
           x
group str   
a     10   0
      18   1
      25   2
b     16   3
      26   4
a     45   5

In [31]: df2.sort_index()
Out[31]: 
           x
group str   
a     10   0
      18   1
      25   2
b     16   3
a     45   5
b     26   4

This was discovered while trying to debug why Beam tests weren't working right under Pandas 2.1. I can't get this bug to reproduce when I start with the first 6 elements of ix when making df, for some reason I have to have a longer index to start it off.

Note that this works in 2.0.3.

Expected Behavior

The index should be sorted, the same way it is for df2.index.sort_values()

Installed Versions

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.5.final.0
python-bits : 64
OS : Linux
OS-release : 6.3.11-1rodete2-amd64
Version : #1 SMP PREEMPT_DYNAMIC Debian 6.3.11-1rodete2 (2023-08-24)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : 7.4.2
hypothesis : 6.84.3
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : None
IPython : 8.15.0
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.2
sqlalchemy : 1.4.49
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@caneff caneff added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 3, 2023
@rhshadrach
Copy link
Member

Thanks for the report. I can reproduce on 2.1.x but not on main. We should run some git bisects to figure out where this was fixed and consider back porting the patch.

@rhshadrach rhshadrach added Regression Functionality that used to work in a prior pandas version MultiIndex labels Oct 4, 2023
@rhshadrach rhshadrach added this to the 2.1.2 milestone Oct 4, 2023
@rhshadrach
Copy link
Member

git bisect gives

Date:   Wed Aug 30 13:02:18 2023 -0400

    PERF: lexsort_indexer (MultiIndex / multi-column sorting) (#54835)

cc @lukemanley. Haven't looked into whether this is directly able to be back ported, nor whether we should have tests added.

@lukemanley
Copy link
Member

lukemanley commented Oct 4, 2023

It looks like the bug was introduced in #51672 in the 2.1 cycle (cc @phofl) and has since been fixed via #54835 on main.

I think we should backport a fix, but I'm not sure if we want to try and backport #54835 or backport a more targeted fix.

For a more targeted fix, I think this line needs to change:

n = len(codes)

to something like:

n = codes.max() + 1 if len(codes) else 0

The issue here is that codes can be a non-compressed set of codes (e.g. from a slice on a MultiIndex) so we cannot rely on length.

@lukemanley lukemanley removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 4, 2023
@phofl
Copy link
Member

phofl commented Oct 4, 2023

targeted backport sounds good

@lukemanley
Copy link
Member

Closed via #55474

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants