-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Introduce parity with pandas for MultiIndex.loc
ordering & fix a bug in Groupby
with as_index
#13657
[REVIEW] Introduce parity with pandas for MultiIndex.loc
ordering & fix a bug in Groupby
with as_index
#13657
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine. Once this is merged, please get the permalink for this TODO and copy it into some issue about deterministic join/merge. There is at least one existing issue on this topic but I can’t search for it at this moment.
/merge |
if cudf.get_option("mode.pandas_compatible"): | ||
# Sort indices in pandas compatible mode | ||
# because we want the indices to be fetched | ||
# in a deterministic order. | ||
# TODO: Remove this after merge/join | ||
# obtain deterministic ordering. | ||
indices = indices.sort_values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this can't do the right thing, because the indices you get back from the merge that lives inside _get_valid_indices_by_tuple
don't induce the order you want on the result.
For example (with this change applied):
import pandas as pd
import cudf
df = pd.DataFrame({"a": [1, 1, 2, 3], "b": [1, 0, 1, 1], "c": [1, 2, 3, 4]}).set_index(["a", "b"])
cdf = cudf.from_pandas()
# pandas behaviour
In [24]: df.loc[([1, 1], [0, 1]), :]
Out[24]:
c
a b
1 0 2
1 1
In [25]: df.loc[([1, 1], [1, 0]), :]
Out[25]:
c
a b
1 1 1
0 2
# (new) cudf behaviour
n [26]: cdf.loc[([1, 1], [0, 1]), :]
Out[26]:
c
a b
1 1 1
0 2
In [27]: cdf.loc[([1, 1], [1, 0]), :]
Out[27]:
c
a b
1 1 1
0 2
Note how in the cudf case both return values are now ordered the same.
Do you instead want something like this in _compute_validity_mask
?
diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py
index d8a08647f3..0ee741e066 100644
--- a/python/cudf/cudf/core/multiindex.py
+++ b/python/cudf/cudf/core/multiindex.py
@@ -759,6 +759,8 @@ class MultiIndex(Frame, BaseIndex, NotIterable):
if isinstance(row, slice) and row == slice(None):
continue
lookup[i] = cudf.Series(row)
+ cantor_name = "_" + "_".join(map(str, lookup.columns))
+ lookup[cantor_name] = column.arange(len(lookup))
frame = cudf.DataFrame(dict(enumerate(index._data.columns)))
data_table = cudf.concat(
[
@@ -769,7 +771,9 @@ class MultiIndex(Frame, BaseIndex, NotIterable):
],
axis=1,
)
- result = lookup.merge(data_table)["idx"]
+ result = lookup.merge(data_table).sort_values(by=[cantor_name, "idx"])[
+ "idx"
+ ]
# Avoid computing levels unless the result of the merge is empty,
# which suggests that a KeyError should be raised.
if len(result) == 0:
I note in passing for pandas-2 compat, the merge in _compute_validity_mask
should be a left join and a KeyError
should be raised for any missing keys (if the null_count
of the indices is greater than 0).
But I can't really follow all the back-and-forth here, so let's just leave that for now.
Sorry, I had a review pending but didn't hit go in time. I think this change was wrong unfortunately :( |
Apologies for the merge @wence-, I think the solution you pointed out here: #13657 (comment) seems good to me. Could you open a PR with same changes that I can review & approve instead? I'm not on my workstation, or else I would have done it. |
#13657 attempted to provide pandas-compatible ordering to the return value from multiindex-based `loc`-indexing. However, one must sort by keys corresponding to both the left and right gather maps of the join, rather than just the right gather map, to ensure that correct ordering is maintained. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13660
Description
This PR:
MultiIndex.loc
when pandas compatibility mode is enabled.Groupby
whenas_index
is passed and is not being respected when a getitem call is performed on the groupby.Checklist