You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I would like to perform COLLECT aggregations on Struct columns. Based on initial triage by @shwina , adding COLLECT to _STRUCT_AGGS in groupby.pyx would be the first step.
Describe the solution you'd like
I would like groupby.agg(list) to be supported for Struct columns.
Describe alternatives you've considered
I couldn't find a way to construct a nested List<Struct> without doing a device-host roundtrip and assembling a new cudf.DataFrame. If there are other tools we could use to pack a Struct column into a List column I would love to know.
Additional context
Here is an example of packing a Struct column into a List column.
>>> df = cudf.DataFrame(index=[0,0,1], data={'a':[
... {'k':'v1'}, {'k':'v2'}, {'k':'v3'}
... ]})
>>> df
a
0 {'k': 'v1'}
0 {'k': 'v2'}
1 {'k': 'v3'}
>>> df.groupby(level=0).agg(list)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 458, in agg
) = self._groupby.aggregate(columns, normalized_aggs)
File "groupby.pyx", line 295, in cudf._lib.groupby.GroupBy.aggregate
File "groupby.pyx", line 185, in cudf._lib.groupby.GroupBy.aggregate_internal
pandas.core.base.DataError: All requested aggregations are unsupported.
>>> expected = cudf.DataFrame({'a':[
... [{'k':'v1'}, {'k':'v2'}],[{'k':'v3'}]
... ]})
>>> expected
a
0 [{'k': 'v1'}, {'k': 'v2'}]
1 [{'k': 'v3'}]
The text was updated successfully, but these errors were encountered:
then libcudf will generate columns with the correct data. However, since struct dtype field names are managed in Python the field names are lost. Similar to #11687, I think this can be fixed by transferring relevant dtype information over post-hoc.
As usual when returning from libcudf, we need to reconstruct a struct
dtype with appropriate labels. For groupby.agg(list) this can be done
by matching on the element_type of the result column and
reconstructing with a new list dtype with a leaf from the original
column.
Closes#11765Closes#11907
Authors:
- Lawrence Mitchell (https://github.com/wence-)
Approvers:
- Ashwin Srinath (https://github.com/shwina)
URL: #12290
Is your feature request related to a problem? Please describe.
I would like to perform
COLLECT
aggregations onStruct
columns. Based on initial triage by @shwina , addingCOLLECT
to_STRUCT_AGGS
in groupby.pyx would be the first step.Describe the solution you'd like
I would like
groupby.agg(list)
to be supported for Struct columns.Describe alternatives you've considered
I couldn't find a way to construct a nested
List<Struct>
without doing a device-host roundtrip and assembling a newcudf.DataFrame
. If there are other tools we could use to pack aStruct
column into aList
column I would love to know.Additional context
Here is an example of packing a
Struct
column into aList
column.The text was updated successfully, but these errors were encountered: