Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby.head and groupby.tail #12939

Merged
merged 9 commits into from
Mar 16, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Mar 14, 2023

Description

These methods can be implemented by grouping the dataframe and
then selecting appropriate slices from each group. This is less
memory-efficient than it could be (since the entire grouping must
be constructed before discarding most of it).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner March 14, 2023 15:47
@wence- wence- self-assigned this Mar 14, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 14, 2023
@wence- wence- added pandas improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 14, 2023
# into the grouping, but that probably requires a new
# aggregation scheme in libcudf. This is probably "fast
# enough" for most reasonable input sizes.
_, offsets, _, group_values = self._grouped()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this OOMs for setups where I really think it shouldn't. I tried (on an A6000, 48GiB device memory) creating a dataframe with two int-columns with 10^9 rows (and 10^6 groups) [8GiB data]. df.groupby("a")._grouped() consistently OOMs for me. AIUI this is a sort-based implementation of reordering which I would have thought runs in-place on a copy of the table, so I was expecting to peak at around 16GiB of data (even if it is out-of-place I would expect not much more than 24GiB of peak footprint). That said, I don't know enough about the libcudf implementation of groupbys to understand if my expectation is reasonable.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (pending docstring changes). Thanks, @wence-!

@wence-
Copy link
Contributor Author

wence- commented Mar 16, 2023

/merge

@rapids-bot rapids-bot bot merged commit 9ceecb1 into rapidsai:branch-23.04 Mar 16, 2023
@wence- wence- deleted the wence/fea/groupby-head-tail branch March 16, 2023 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] groupby.head and tail [FEA] top N rows by group
6 participants