Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] list-aggregation for struct columns #11765

Closed
epifanio opened this issue Sep 26, 2022 · 4 comments · Fixed by #12290
Closed

[QST] list-aggregation for struct columns #11765

epifanio opened this issue Sep 26, 2022 · 4 comments · Fixed by #12290
Assignees
Labels
improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. question Further information is requested

Comments

@epifanio
Copy link

Running the following code in cudf:

import cudf
import numpy as np
import pandas as pd

data = {
    "a": [1, 10, 10],
    "b": [2, 11, 33],
    "c": [3, 12, 45],
    "d": [4, 13, 34],
    "e": [1, 10, 67],
    "f": [2, 11, 56],
    "g": [3, 12, 56],
    "h": [4, 13, 67],
}
df = pd.DataFrame.from_dict(data)

full_gdf = cudf.from_pandas(df)
gdf = full_gdf[["a", "b", "c", "d"]]
gdf["key"] = np.arange(len(gdf))

melted = gdf.melt(id_vars=["key"], value_name="struct_key_name")  # wide to long format
gdf["new"] = melted.groupby("key").collect()[["struct_key_name"]].to_struct()

records_by_col_a = gdf.groupby("a").agg(
    {"b": list, "c": list, "d": list, "new": list}
)

the gdf dataframe will look like:

	a	b	c	d	key	new
0	1	2	3	4	0	{'struct_key_name': [2, 4, 3, 1]}
1	10	11	12	13	1	{'struct_key_name': [11, 13, 12, 10]}
2	10	33	45	34	2	{'struct_key_name': [33, 34, 45, 10]}

At this point I need to run a groupby.agg() operation, like:

records_by_col_a = gdf.groupby("a").agg(
    { "c": list, "d": list, "new": list}
)

But the field new get's silently ignored - I guess because of the following Data Error:

---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
Cell In [79], line 1
----> 1 gdf.groupby("a").agg(
      2     {"new": list}
      3 )

File /opt/conda/envs/rapids-22.10/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /opt/conda/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:458, in GroupBy.agg(self, func)
    449 column_names, columns, normalized_aggs = self._normalize_aggs(func)
    451 # Note: When there are no key columns, the below produces
    452 # a Float64Index, while Pandas returns an Int64Index
    453 # (GH: 6945)
    454 (
    455     result_columns,
    456     grouped_key_cols,
    457     included_aggregations,
--> 458 ) = self._groupby.aggregate(columns, normalized_aggs)
    460 result_index = self.grouping.keys._from_columns_like_self(
    461     grouped_key_cols,
    462 )
    464 multilevel = _is_multi_agg(func)

File groupby.pyx:295, in cudf._lib.groupby.GroupBy.aggregate()

File groupby.pyx:185, in cudf._lib.groupby.GroupBy.aggregate_internal()

DataError: All requested aggregations are unsupported.

which can be triggered by:

gdf.groupby("a").agg(
    {"new": list}
)

Is there an alternative way to get around this problem and create a dataframe that looks like:

c d new
3 4 {'struct_key_name': [2, 4, 3, 1]}
[12, 45] [13, 34] [{'struct_key_name': [11, 13, 12, 10]}, {'struct_key_name': [33, 34, 45, 10]}]
@epifanio epifanio added Needs Triage Need team to review and classify question Further information is requested labels Sep 26, 2022
@wence- wence- changed the title [QST] [QST] list-aggregation for struct columns Sep 29, 2022
@wence-
Copy link
Contributor

wence- commented Sep 29, 2022

I guess there are two issues here:

  1. silently ignoring requested aggregations that are unsupported if there are other supported aggregations;
  2. supporting the list aggregation method for struct-based columns.

From cursory examination, it appears that list (which internally turns into "collect") does almost work for struct columns (the struct key names are lost, but I think that is an easy fix). So enabling that would fix both problems.

Aside: your example requested output is not quite correct (the row corresponding to the grouping of a == 1 is also list-like).

@wence- wence- added the improvement Improvement / enhancement to an existing function label Sep 29, 2022
@GregoryKimball
Copy link
Contributor

the struct key names are lost, but I think that is an easy fix

@wence- was this solved by 11687, 11671?

@GregoryKimball GregoryKimball added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 21, 2022
@wence-
Copy link
Contributor

wence- commented Oct 21, 2022

@wence- was this solved by 11687, 11671?

I don't think so, the struct name dtype issue needs to be solved sui generis for every case right now :(

@wence- wence- self-assigned this Dec 1, 2022
@wence-
Copy link
Contributor

wence- commented Dec 1, 2022

See also #11907

wence- added a commit to wence-/cudf that referenced this issue Dec 2, 2022
As usual when returning from libcudf, we need to reconstruct a struct
dtype with appropriate labels. For groupby.agg(list) this can be done
by matching on the element_type of the result column and
reconstructing with a new list dtype with a leaf from the original
column.

Closes rapidsai#11765.
wence- added a commit to wence-/cudf that referenced this issue Jan 17, 2023
As usual when returning from libcudf, we need to reconstruct a struct
dtype with appropriate labels. For groupby.agg(list) this can be done
by matching on the element_type of the result column and
reconstructing with a new list dtype with a leaf from the original
column.

Closes rapidsai#11765.
rapids-bot bot pushed a commit that referenced this issue Jan 23, 2023
As usual when returning from libcudf, we need to reconstruct a struct
dtype with appropriate labels. For groupby.agg(list) this can be done
by matching on the element_type of the result column and
reconstructing with a new list dtype with a leaf from the original
column.

Closes #11765
Closes #11907

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12290
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants