[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

GregoryKimball · 2022-10-12T03:07:34Z

Is your feature request related to a problem? Please describe.
I would like to perform COLLECT aggregations on Struct columns. Based on initial triage by @shwina , adding COLLECT to _STRUCT_AGGS in groupby.pyx would be the first step.

Describe the solution you'd like
I would like groupby.agg(list) to be supported for Struct columns.

Describe alternatives you've considered
I couldn't find a way to construct a nested List<Struct> without doing a device-host roundtrip and assembling a new cudf.DataFrame. If there are other tools we could use to pack a Struct column into a List column I would love to know.

Additional context
Here is an example of packing a Struct column into a List column.

>>> df = cudf.DataFrame(index=[0,0,1], data={'a':[
...    {'k':'v1'}, {'k':'v2'}, {'k':'v3'}
... ]})
>>> df
             a
0  {'k': 'v1'}
0  {'k': 'v2'}
1  {'k': 'v3'}
>>> df.groupby(level=0).agg(list)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 458, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 295, in cudf._lib.groupby.GroupBy.aggregate
  File "groupby.pyx", line 185, in cudf._lib.groupby.GroupBy.aggregate_internal
pandas.core.base.DataError: All requested aggregations are unsupported.
>>> expected = cudf.DataFrame({'a':[
...    [{'k':'v1'}, {'k':'v2'}],[{'k':'v3'}]
... ]})
>>> expected
                            a
0  [{'k': 'v1'}, {'k': 'v2'}]
1               [{'k': 'v3'}]

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2022-10-12T03:12:25Z

Note, simply adding the COLLECT aggregation works to create a List column, but the full dtype is not preserved.

>>> df = cudf.DataFrame(index=[0,0,1], data={'a':[
...    {'k':'v1'}, {'k':'v2'}, {'k':'v3'}
... ]})
>>> df.groupby(level=0).agg(list)
                            a
0  [{'0': 'v1'}, {'0': 'v2'}]
1               [{'0': 'v3'}]

See 11687, 11671.

ttnghia · 2022-10-12T03:46:38Z

Sorry is this FEA for Python or Java?
We have the collect_list aggregation and I'm still not clear on what is difference from COLLECT here?

wence- · 2022-10-12T08:24:33Z

Pretty sure this is for Python (have updated labels); as @GregoryKimball notes if you add "COLLECT" to the set of supported aggregations here

cudf/python/cudf/cudf/_lib/groupby.pyx

Line 61 in 387192c

_STRUCT_AGGS = {"CORRELATION", "COVARIANCE"}

then libcudf will generate columns with the correct data. However, since struct dtype field names are managed in Python the field names are lost. Similar to #11687, I think this can be fixed by transferring relevant dtype information over post-hoc.

As usual when returning from libcudf, we need to reconstruct a struct dtype with appropriate labels. For groupby.agg(list) this can be done by matching on the element_type of the result column and reconstructing with a new list dtype with a leaf from the original column. Closes #11765 Closes #11907 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #12290

GregoryKimball added feature request New feature or request Java Affects Java cuDF API. labels Oct 12, 2022

GregoryKimball added this to the List and Struct data types and operations milestone Oct 12, 2022

wence- added Python Affects Python cuDF API. and removed Java Affects Java cuDF API. labels Oct 12, 2022

wence- self-assigned this Dec 1, 2022

This was referenced Dec 1, 2022

[QST] list-aggregation for struct columns #11765

Closed

Reconstruct dtypes correctly for list aggs of struct columns #12290

Merged

rapids-bot bot closed this as completed in #12290 Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

GregoryKimball commented Oct 12, 2022

GregoryKimball commented Oct 12, 2022

ttnghia commented Oct 12, 2022

wence- commented Oct 12, 2022

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

Comments

GregoryKimball commented Oct 12, 2022

GregoryKimball commented Oct 12, 2022

ttnghia commented Oct 12, 2022

wence- commented Oct 12, 2022