-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement groupby.head
and groupby.tail
#12939
Implement groupby.head
and groupby.tail
#12939
Conversation
- Closes rapidsai#2592 - Closes rapidsai#12245
# into the grouping, but that probably requires a new | ||
# aggregation scheme in libcudf. This is probably "fast | ||
# enough" for most reasonable input sizes. | ||
_, offsets, _, group_values = self._grouped() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this OOMs for setups where I really think it shouldn't. I tried (on an A6000, 48GiB device memory) creating a dataframe with two int-columns with 10^9 rows (and 10^6 groups) [8GiB data]. df.groupby("a")._grouped()
consistently OOMs for me. AIUI this is a sort-based implementation of reordering which I would have thought runs in-place on a copy of the table, so I was expecting to peak at around 16GiB of data (even if it is out-of-place I would expect not much more than 24GiB of peak footprint). That said, I don't know enough about the libcudf implementation of groupbys to understand if my expectation is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me (pending docstring changes). Thanks, @wence-!
/merge |
Description
These methods can be implemented by grouping the dataframe and
then selecting appropriate slices from each group. This is less
memory-efficient than it could be (since the entire grouping must
be constructed before discarding most of it).
Checklist