Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index #13657

Merged
merged 8 commits into from
Jul 5, 2023

Conversation

galipremsagar
Copy link
Contributor

Description

This PR:

  • Adds ordering in MultiIndex.loc when pandas compatibility mode is enabled.
  • Fixes a bug in Groupby when as_index is passed and is not being respected when a getitem call is performed on the groupby.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer non-breaking Non-breaking change labels Jul 3, 2023
@galipremsagar galipremsagar requested a review from a team as a code owner July 3, 2023 21:21
@galipremsagar galipremsagar self-assigned this Jul 3, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. Once this is merged, please get the permalink for this TODO and copy it into some issue about deterministic join/merge. There is at least one existing issue on this topic but I can’t search for it at this moment.

@galipremsagar
Copy link
Contributor Author

Thanks @bdice ! Made a note of it in #12236

@galipremsagar
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 9da347e into rapidsai:branch-23.08 Jul 5, 2023
Comment on lines +904 to +910
if cudf.get_option("mode.pandas_compatible"):
# Sort indices in pandas compatible mode
# because we want the indices to be fetched
# in a deterministic order.
# TODO: Remove this after merge/join
# obtain deterministic ordering.
indices = indices.sort_values()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this can't do the right thing, because the indices you get back from the merge that lives inside _get_valid_indices_by_tuple don't induce the order you want on the result.

For example (with this change applied):

import pandas as pd
import cudf
df = pd.DataFrame({"a": [1, 1, 2, 3], "b": [1, 0, 1, 1], "c": [1, 2, 3, 4]}).set_index(["a", "b"])
cdf = cudf.from_pandas()
# pandas behaviour
In [24]: df.loc[([1, 1], [0, 1]), :]
Out[24]: 
     c
a b   
1 0  2
  1  1

In [25]: df.loc[([1, 1], [1, 0]), :]
Out[25]: 
     c
a b   
1 1  1
  0  2

# (new) cudf behaviour
n [26]: cdf.loc[([1, 1], [0, 1]), :]
Out[26]: 
     c
a b   
1 1  1
  0  2

In [27]: cdf.loc[([1, 1], [1, 0]), :]
Out[27]: 
     c
a b   
1 1  1
  0  2

Note how in the cudf case both return values are now ordered the same.

Do you instead want something like this in _compute_validity_mask ?

diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py
index d8a08647f3..0ee741e066 100644
--- a/python/cudf/cudf/core/multiindex.py
+++ b/python/cudf/cudf/core/multiindex.py
@@ -759,6 +759,8 @@ class MultiIndex(Frame, BaseIndex, NotIterable):
             if isinstance(row, slice) and row == slice(None):
                 continue
             lookup[i] = cudf.Series(row)
+        cantor_name = "_" + "_".join(map(str, lookup.columns))
+        lookup[cantor_name] = column.arange(len(lookup))
         frame = cudf.DataFrame(dict(enumerate(index._data.columns)))
         data_table = cudf.concat(
             [
@@ -769,7 +771,9 @@ class MultiIndex(Frame, BaseIndex, NotIterable):
             ],
             axis=1,
         )
-        result = lookup.merge(data_table)["idx"]
+        result = lookup.merge(data_table).sort_values(by=[cantor_name, "idx"])[
+            "idx"
+        ]
         # Avoid computing levels unless the result of the merge is empty,
         # which suggests that a KeyError should be raised.
         if len(result) == 0:

I note in passing for pandas-2 compat, the merge in _compute_validity_mask should be a left join and a KeyError should be raised for any missing keys (if the null_count of the indices is greater than 0).

But I can't really follow all the back-and-forth here, so let's just leave that for now.

@wence-
Copy link
Contributor

wence- commented Jul 5, 2023

Sorry, I had a review pending but didn't hit go in time. I think this change was wrong unfortunately :(

@galipremsagar
Copy link
Contributor Author

Sorry, I had a review pending but didn't hit go in time. I think this change was wrong unfortunately :(

Apologies for the merge @wence-, I think the solution you pointed out here: #13657 (comment) seems good to me. Could you open a PR with same changes that I can review & approve instead? I'm not on my workstation, or else I would have done it.

@wence-
Copy link
Contributor

wence- commented Jul 5, 2023

#13660

rapids-bot bot pushed a commit that referenced this pull request Jul 6, 2023
#13657 attempted to provide pandas-compatible ordering to the return value from multiindex-based `loc`-indexing. However, one must sort by keys corresponding to both the left and right gather maps of the join, rather than just the right gather map, to ensure that  correct ordering is maintained.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - https://github.com/brandon-b-miller
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #13660
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants