Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

Closed
GregoryKimball opened this issue Oct 12, 2022 · 3 comments · Fixed by #12290
Closed

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

GregoryKimball opened this issue Oct 12, 2022 · 3 comments · Fixed by #12290
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@GregoryKimball
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I would like to perform COLLECT aggregations on Struct columns. Based on initial triage by @shwina , adding COLLECT to _STRUCT_AGGS in groupby.pyx would be the first step.

Describe the solution you'd like
I would like groupby.agg(list) to be supported for Struct columns.

Describe alternatives you've considered
I couldn't find a way to construct a nested List<Struct> without doing a device-host roundtrip and assembling a new cudf.DataFrame. If there are other tools we could use to pack a Struct column into a List column I would love to know.

Additional context
Here is an example of packing a Struct column into a List column.

>>> df = cudf.DataFrame(index=[0,0,1], data={'a':[
...    {'k':'v1'}, {'k':'v2'}, {'k':'v3'}
... ]})
>>> df
             a
0  {'k': 'v1'}
0  {'k': 'v2'}
1  {'k': 'v3'}
>>> df.groupby(level=0).agg(list)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 458, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 295, in cudf._lib.groupby.GroupBy.aggregate
  File "groupby.pyx", line 185, in cudf._lib.groupby.GroupBy.aggregate_internal
pandas.core.base.DataError: All requested aggregations are unsupported.
>>> expected = cudf.DataFrame({'a':[
...    [{'k':'v1'}, {'k':'v2'}],[{'k':'v3'}]
... ]})
>>> expected
                            a
0  [{'k': 'v1'}, {'k': 'v2'}]
1               [{'k': 'v3'}]
@GregoryKimball GregoryKimball added feature request New feature or request Java Affects Java cuDF API. labels Oct 12, 2022
@GregoryKimball
Copy link
Contributor Author

Note, simply adding the COLLECT aggregation works to create a List column, but the full dtype is not preserved.

>>> df = cudf.DataFrame(index=[0,0,1], data={'a':[
...    {'k':'v1'}, {'k':'v2'}, {'k':'v3'}
... ]})
>>> df.groupby(level=0).agg(list)
                            a
0  [{'0': 'v1'}, {'0': 'v2'}]
1               [{'0': 'v3'}]

See 11687, 11671.

@ttnghia
Copy link
Contributor

ttnghia commented Oct 12, 2022

Sorry is this FEA for Python or Java?
We have the collect_list aggregation and I'm still not clear on what is difference from COLLECT here?

@wence- wence- added Python Affects Python cuDF API. and removed Java Affects Java cuDF API. labels Oct 12, 2022
@wence-
Copy link
Contributor

wence- commented Oct 12, 2022

Pretty sure this is for Python (have updated labels); as @GregoryKimball notes if you add "COLLECT" to the set of supported aggregations here

_STRUCT_AGGS = {"CORRELATION", "COVARIANCE"}

then libcudf will generate columns with the correct data. However, since struct dtype field names are managed in Python the field names are lost. Similar to #11687, I think this can be fixed by transferring relevant dtype information over post-hoc.

@wence- wence- self-assigned this Dec 1, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 23, 2023
As usual when returning from libcudf, we need to reconstruct a struct
dtype with appropriate labels. For groupby.agg(list) this can be done
by matching on the element_type of the result column and
reconstructing with a new list dtype with a leaf from the original
column.

Closes #11765
Closes #11907

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12290
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants