Transfer correct dtype to exploded column #11687

wence- · 2022-09-12T15:35:17Z

Description

Since libcudf doesn't keep track of StructDtype key names, round-tripping through outer_explode loses information. We know the correct dtype, since it is the element_type of the exploded list column, so attach that type metadata before handing back the return value.

Exploding a list series should be equivalent to unwrapping one level of list from the dtype, so that

x = cudf.Series([[{'a': 'b'}]])
x.explode().dtype == x.dtype.element_type

Previously this was not the case, since we would lose the names resulting in

x.explode().dtype == StructDtype({'0': dtype('O')})

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Since libcudf doesn't keep track of StructDtype key names, round-tripping through outer_explode loses information. We know the correct dtype, since it is the element_type of the exploded list column, so use attach that type metadata before handing back the return value.

wence- · 2022-09-12T15:35:40Z

cc @GregoryKimball @galipremsagar

galipremsagar · 2022-09-12T15:36:41Z

python/cudf/cudf/core/indexed_frame.py

+        ).element_type
+        exploded[explode_column_num] = exploded[
+            explode_column_num
+        ]._with_type_metadata(element_type)


Shouldn't this copying of type metadata happen after self._from_columns_like_self call? i.e., on the result of self._from_columns_like_self?

No, but you would be forgiven for thinking so. This is exploiting (although I didn't notice that this was what was going on until responding here) fallthrough behaviour in _from_columns_like_self in the case where the dtype shape of the target and source frames does not match.

_from_columns_like_self attempts to copy type metadata from self onto other. The interesting part for this argument is how the columns are collected

new_columns = [col._with_type_metadata(self_col.dtype) for col, self_col in zip(other._columns, self._columns)]

OK, so for the non-exploded columns this does the right thing (it's a no-op because the dtypes match). For the exploded column the dtypes do not match. The self version is ListDtype(element_type) the other version is (structurally at least) is element_type. So eventually we end up calling struct_column._with_type_metadata(list_dtype):

def _with_type_metadata(self: StructColumn, dtype: Dtype) -> StructColumn: from cudf.core.column import IntervalColumn from cudf.core.dtypes import IntervalDtype # Check IntervalDtype first because it's a subclass of StructDtype if isinstance(dtype, IntervalDtype): return IntervalColumn.from_struct_column(self, closed=dtype.closed) elif isinstance(dtype, StructDtype): return build_struct_column( names=dtype.fields.keys(), children=tuple( self.base_children[i]._with_type_metadata(dtype.fields[f]) for i, f in enumerate(dtype.fields.keys()) ), mask=self.base_mask, size=self.size, offset=self.offset, null_count=self.null_count, ) return self

We don't tick one of the specific cases, so just return ourselves, and come back up through the call tree.

Given the use case, I'm not sure why we don't just update the dtype of the exploded column and then call _from_columns.

I think we could do either one of these:

Introduce a new parameter cols_to_ignore_copy_metadata to _from_columns_like_self which will ignore copying metadata of the columns while constructing dataframe and will just copy the dtypes of columns that are not in cols_to_ignore_copy_metadata.

Avoid calling self._from_columns_like_self but then hand construct the frame by carefully taking into account which dtypes to be copied to new columns. The reason being, we are constructing a dataframe that is a mix of self-like columns and non-self-like columns.

I feel approach 2 might be the desired solution as this may be a one-off. But as list & struct operations expand more we might need to have some kind of capability as mentioned in approach 1.

cc: @shwina if you have any insights on this.

I've gone for option (2) for now.

bdice

A few comments on test strategies.

python/cudf/cudf/tests/test_list.py

Partially replicates the implementation of _from_columns_like_self, but makes it explicit where the dtype is cominng from for each column.

codecov · 2022-09-13T11:37:52Z

Codecov Report

Base: 86.39% // Head: 86.42% // Increases project coverage by +0.03% 🎉

Coverage data is based on head (ed64e6a) compared to base (9f8db66).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head ed64e6a differs from pull request most recent head 2e06a63. Consider uploading reports for the commit 2e06a63 to get more accurate results

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.10   #11687      +/-   ##
================================================
+ Coverage         86.39%   86.42%   +0.03%     
================================================
  Files               145      145              
  Lines             23005    23017      +12     
================================================
+ Hits              19875    19893      +18     
+ Misses             3130     3124       -6

Impacted Files	Coverage Δ
python/cudf/cudf/utils/ioutils.py	`79.47% <ø> (ø)`
python/cudf/cudf/core/_base_index.py	`85.54% <100.00%> (ø)`
python/cudf/cudf/core/column/column.py	`88.62% <100.00%> (ø)`
python/cudf/cudf/core/column/lists.py	`93.75% <100.00%> (+1.10%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.77% <100.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/frame.py	`93.69% <100.00%> (+0.01%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.22% <100.00%> (+0.31%)`	⬆️
python/cudf/cudf/core/index.py	`92.63% <100.00%> (ø)`
python/cudf/cudf/core/indexed_frame.py	`92.19% <100.00%> (+0.01%)`	⬆️
python/cudf/cudf/core/multiindex.py	`92.46% <100.00%> (ø)`
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

python/cudf/cudf/core/indexed_frame.py

Instead allow caller of _from_columns_like_self to provide an iterable of dtypes that should override the dtypes picked up from the columns of self.

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/core/frame.py

python/cudf/cudf/core/indexed_frame.py

bdice

A couple minor refactoring proposals but I'm fine with any resolution on those (changed or not). Approving!

wence- · 2022-09-14T13:33:20Z

Failing against dask nightly due to dask/dask#9490

galipremsagar · 2022-09-19T17:51:48Z

rerun tests

galipremsagar · 2022-09-19T18:29:35Z

@gpucibot merge

… geocolumns specifically. (#684) This PR follows up rapidsai/cudf#11687 to adapt to the change to `_copy_type_metadata` interface. This PR should unblock CI. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Mark Harris (https://github.com/harrism) - H. Thomson Comer (https://github.com/thomcom) URL: #684

wence- added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change labels Sep 12, 2022

wence- requested a review from a team as a code owner September 12, 2022 15:35

wence- requested review from bdice and mroeschke September 12, 2022 15:35

galipremsagar reviewed Sep 12, 2022

View reviewed changes

bdice reviewed Sep 12, 2022

View reviewed changes

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

mroeschke reviewed Sep 12, 2022

View reviewed changes

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

wence- added 3 commits September 13, 2022 10:19

No hypothesis this time

a15731d

Factor out index metadata transfer into separate method

08a2aa7

Sui generis copy of type metadata in _explode

b86fdaf

Partially replicates the implementation of _from_columns_like_self, but makes it explicit where the dtype is cominng from for each column.

wence- commented Sep 13, 2022

View reviewed changes

python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved

Back out by-hand type construction in _explode

c1bb83e

Instead allow caller of _from_columns_like_self to provide an iterable of dtypes that should override the dtypes picked up from the columns of self.

wence- commented Sep 13, 2022

View reviewed changes

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved

wence- commented Sep 13, 2022

View reviewed changes

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved

More new keyword arguments

9d72177

bdice reviewed Sep 13, 2022

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

bdice reviewed Sep 13, 2022

View reviewed changes

python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved

bdice approved these changes Sep 13, 2022

View reviewed changes

wence- added 2 commits September 14, 2022 09:31

No need for enumerate

3e409e2

Replace incomprehensible Haskell with imperative code

2e06a63

mroeschke approved these changes Sep 14, 2022

View reviewed changes

galipremsagar approved these changes Sep 19, 2022

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 19, 2022

rapids-bot bot merged commit b2ffea7 into rapidsai:branch-22.10 Sep 19, 2022

wence- deleted the wence/fix/explode-nested-dtype-copy branch September 20, 2022 08:51

isVoid mentioned this pull request Sep 21, 2022

Rewrite copy_type_metadata to reuse the inherited method and handle geocolumns specifically. rapidsai/cuspatial#684

Merged

3 tasks

GregoryKimball mentioned this pull request Oct 12, 2022

[FEA] Support "COLLECT" aggregation on struct columns in cuDF #11907

Closed

GregoryKimball mentioned this pull request Oct 21, 2022

[QST] list-aggregation for struct columns #11765

Closed

GregoryKimball mentioned this pull request Nov 28, 2022

[BUG] Transfer correct dtype to columns returned by list.get #12248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer correct dtype to exploded column #11687

Transfer correct dtype to exploded column #11687

wence- commented Sep 12, 2022 •

edited by vyasr

Loading

wence- commented Sep 12, 2022

galipremsagar Sep 12, 2022

wence- Sep 12, 2022

galipremsagar Sep 12, 2022 •

edited

Loading

wence- Sep 13, 2022

bdice left a comment

codecov bot commented Sep 13, 2022 •

edited

Loading

bdice left a comment

wence- commented Sep 14, 2022

galipremsagar commented Sep 19, 2022

galipremsagar commented Sep 19, 2022

Transfer correct dtype to exploded column #11687

Transfer correct dtype to exploded column #11687

Conversation

wence- commented Sep 12, 2022 • edited by vyasr Loading

Description

Checklist

wence- commented Sep 12, 2022

galipremsagar Sep 12, 2022

Choose a reason for hiding this comment

wence- Sep 12, 2022

Choose a reason for hiding this comment

galipremsagar Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

wence- Sep 13, 2022

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 13, 2022 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

wence- commented Sep 14, 2022

galipremsagar commented Sep 19, 2022

galipremsagar commented Sep 19, 2022

wence- commented Sep 12, 2022 •

edited by vyasr

Loading

galipremsagar Sep 12, 2022 •

edited

Loading

codecov bot commented Sep 13, 2022 •

edited

Loading