Implement `groupby.head` and `groupby.tail` #12939

wence- · 2023-03-14T15:47:24Z

Description

These methods can be implemented by grouping the dataframe and
then selecting appropriate slices from each group. This is less
memory-efficient than it could be (since the entire grouping must
be constructed before discarding most of it).

Closes [FEA] top N rows by group #2592
Closes [FEA] groupby.head and tail #12245

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

- Closes rapidsai#2592 - Closes rapidsai#12245

python/cudf/cudf/core/groupby/groupby.py

python/cudf/cudf/tests/test_groupby.py

wence- · 2023-03-14T18:37:58Z

python/cudf/cudf/core/groupby/groupby.py

+        # into the grouping, but that probably requires a new
+        # aggregation scheme in libcudf. This is probably "fast
+        # enough" for most reasonable input sizes.
+        _, offsets, _, group_values = self._grouped()


Note that this OOMs for setups where I really think it shouldn't. I tried (on an A6000, 48GiB device memory) creating a dataframe with two int-columns with 10^9 rows (and 10^6 groups) [8GiB data]. df.groupby("a")._grouped() consistently OOMs for me. AIUI this is a sort-based implementation of reordering which I would have thought runs in-place on a copy of the table, so I was expecting to peak at around 16GiB of data (even if it is out-of-place I would expect not much more than 24GiB of peak footprint). That said, I don't know enough about the libcudf implementation of groupbys to understand if my expectation is reasonable.

python/cudf/cudf/core/groupby/groupby.py

python/cudf/cudf/tests/test_groupby.py

python/cudf/cudf/core/groupby/groupby.py

shwina

This looks good to me (pending docstring changes). Thanks, @wence-!

wence- · 2023-03-16T15:59:43Z

/merge

Implement groupby.head and groupby.tail

caa2970

- Closes rapidsai#2592 - Closes rapidsai#12245

wence- requested a review from a team as a code owner March 14, 2023 15:47

wence- requested review from galipremsagar and charlesbluca March 14, 2023 15:47

wence- self-assigned this Mar 14, 2023

github-actions bot added the Python Affects Python cuDF API. label Mar 14, 2023

wence- added pandas improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 14, 2023

wence- added this to the Pandas API Alignment and Coverage milestone Mar 14, 2023

wence- commented Mar 14, 2023

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved

shwina reviewed Mar 14, 2023

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Show resolved Hide resolved

bdice reviewed Mar 14, 2023

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Show resolved Hide resolved

wence- added 3 commits March 14, 2023 17:31

Add examples to docstrings

5d06021

Add nvtx annotatations to groupby.head/tail

ceb5681

Use _empty_like not iloc[:0]

f52e030

wence- commented Mar 14, 2023

View reviewed changes

python/cudf/cudf/tests/test_groupby.py Outdated Show resolved Hide resolved

wence- commented Mar 14, 2023

View reviewed changes

beckernick reviewed Mar 14, 2023

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved

wence- added 3 commits March 15, 2023 10:19

Merge branch 'branch-23.04' into wence/fea/groupby-head-tail

fe0085b

Add optional order-preservation to groupby.head/tail

7273484

Make preserve_order keyword-only

8738313

shwina reviewed Mar 15, 2023

View reviewed changes

python/cudf/cudf/tests/test_groupby.py Show resolved Hide resolved

wence- commented Mar 15, 2023

View reviewed changes

python/cudf/cudf/tests/test_groupby.py Show resolved Hide resolved

vyasr mentioned this pull request Mar 15, 2023

[FEA] Add a new cuDF option stable_sort that provides ordering guarantees for otherwise nondeterministic APIs #12236

Closed

groupby.head/tail match pandas order by default

ab9c585

shwina reviewed Mar 16, 2023

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Show resolved Hide resolved

shwina approved these changes Mar 16, 2023

View reviewed changes

Update examples in docstrings

c47652a

rapids-bot bot merged commit 9ceecb1 into rapidsai:branch-23.04 Mar 16, 2023

wence- deleted the wence/fea/groupby-head-tail branch March 16, 2023 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `groupby.head` and `groupby.tail` #12939

Implement `groupby.head` and `groupby.tail` #12939

wence- commented Mar 14, 2023

wence- Mar 14, 2023

shwina left a comment

wence- commented Mar 16, 2023

Implement groupby.head and groupby.tail #12939

Implement groupby.head and groupby.tail #12939

Conversation

wence- commented Mar 14, 2023

Description

Checklist

wence- Mar 14, 2023

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

wence- commented Mar 16, 2023

Implement `groupby.head` and `groupby.tail` #12939

Implement `groupby.head` and `groupby.tail` #12939