Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GroupBy Transform return incorrect results #11067

Closed
cdeotte opened this issue Jun 7, 2022 · 3 comments · Fixed by #11068
Closed

[BUG] GroupBy Transform return incorrect results #11067

cdeotte opened this issue Jun 7, 2022 · 3 comments · Fixed by #11068
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@cdeotte
Copy link

cdeotte commented Jun 7, 2022

Describe the bug
cuDF's groupby transform returns incorrect results

Steps/Code to reproduce bug
I have noticed that when the index contains non-consecutive numbers, then cuDF groupby transform returns incorrect results.

Expected behavior
cuDF groupby transform should return correct results

Environment overview (please complete the following information)
Version cuDF is 22.04.00

Additional context
Here is a jupyter notebook to reproduce the error
https://github.com/cdeotte/RAPIDS-development/blob/master/bugs/bug005.ipynb

@cdeotte cdeotte added Needs Triage Need team to review and classify bug Something isn't working labels Jun 7, 2022
@shwina shwina self-assigned this Jun 7, 2022
@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 7, 2022
@shwina
Copy link
Contributor

shwina commented Jun 7, 2022

The issue may simply be that we discard that index during a groupby transform, while Pandas maintains it:

In [13]: df
Out[13]:
   a  b
7  2  1
6  4  2
5  1  3
4  1  4
3  2  5
2  3  6
1  1  7

In [14]: df.groupby('a').transform('max')
Out[14]:
   b
0  5
1  2
2  7
3  7
4  5
5  6
6  7

In [15]: df.to_pandas().groupby('a').transform('max')
Out[15]:
   b
7  5
6  2
5  7
4  7
3  5
2  6
1  7

@bdice
Copy link
Contributor

bdice commented Jun 7, 2022

@shwina I think we may be susceptible to this class of error (dropping an input’s index or name when creating the output) in other places too. I caught a similar problem in #10715. If you can determine whether this is (1) not being tested or (2) not being caught by tests, that would be helpful.

(edit: this problem is slightly different than the class of error I had in mind -- nothing special is needed here. We're probably fine.)

@shwina
Copy link
Contributor

shwina commented Jun 7, 2022

I verified that #11068 fixes this issue.

rapids-bot bot pushed a commit that referenced this issue Jun 8, 2022
I believe this should close #11067, but I'm unable to reproduce the original bug locally. Will report back here once I'm able to do that.

Edit: it does.

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: #11068
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants