Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer correct dtype to exploded column #11687

Merged

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Sep 12, 2022

Description

Since libcudf doesn't keep track of StructDtype key names, round-tripping through outer_explode loses information. We know the correct dtype, since it is the element_type of the exploded list column, so attach that type metadata before handing back the return value.

Exploding a list series should be equivalent to unwrapping one level of list from the dtype, so that

x = cudf.Series([[{'a': 'b'}]])
x.explode().dtype == x.dtype.element_type

Previously this was not the case, since we would lose the names resulting in

x.explode().dtype == StructDtype({'0': dtype('O')})

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Since libcudf doesn't keep track of StructDtype key names,
round-tripping through outer_explode loses information. We know the
correct dtype, since it is the element_type of the exploded list
column, so use attach that type metadata before handing back the
return value.
@wence- wence- added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change labels Sep 12, 2022
@wence- wence- requested a review from a team as a code owner September 12, 2022 15:35
@wence- wence- requested review from bdice and mroeschke September 12, 2022 15:35
@wence-
Copy link
Contributor Author

wence- commented Sep 12, 2022

cc @GregoryKimball @galipremsagar

).element_type
exploded[explode_column_num] = exploded[
explode_column_num
]._with_type_metadata(element_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this copying of type metadata happen after self._from_columns_like_self call? i.e., on the result of self._from_columns_like_self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but you would be forgiven for thinking so. This is exploiting (although I didn't notice that this was what was going on until responding here) fallthrough behaviour in _from_columns_like_self in the case where the dtype shape of the target and source frames does not match.

_from_columns_like_self attempts to copy type metadata from self onto other. The interesting part for this argument is how the columns are collected

new_columns = [col._with_type_metadata(self_col.dtype) for col, self_col in zip(other._columns, self._columns)]

OK, so for the non-exploded columns this does the right thing (it's a no-op because the dtypes match). For the exploded column the dtypes do not match. The self version is ListDtype(element_type) the other version is (structurally at least) is element_type. So eventually we end up calling struct_column._with_type_metadata(list_dtype):

    def _with_type_metadata(self: StructColumn, dtype: Dtype) -> StructColumn:
        from cudf.core.column import IntervalColumn
        from cudf.core.dtypes import IntervalDtype

        # Check IntervalDtype first because it's a subclass of StructDtype
        if isinstance(dtype, IntervalDtype):
            return IntervalColumn.from_struct_column(self, closed=dtype.closed)
        elif isinstance(dtype, StructDtype):
            return build_struct_column(
                names=dtype.fields.keys(),
                children=tuple(
                    self.base_children[i]._with_type_metadata(dtype.fields[f])
                    for i, f in enumerate(dtype.fields.keys())
                ),
                mask=self.base_mask,
                size=self.size,
                offset=self.offset,
                null_count=self.null_count,
            )

        return self

We don't tick one of the specific cases, so just return ourselves, and come back up through the call tree.

Given the use case, I'm not sure why we don't just update the dtype of the exploded column and then call _from_columns.

Copy link
Contributor

@galipremsagar galipremsagar Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do either one of these:

  1. Introduce a new parameter cols_to_ignore_copy_metadata to _from_columns_like_self which will ignore copying metadata of the columns while constructing dataframe and will just copy the dtypes of columns that are not in cols_to_ignore_copy_metadata.
  2. Avoid calling self._from_columns_like_self but then hand construct the frame by carefully taking into account which dtypes to be copied to new columns. The reason being, we are constructing a dataframe that is a mix of self-like columns and non-self-like columns.

I feel approach 2 might be the desired solution as this may be a one-off. But as list & struct operations expand more we might need to have some kind of capability as mentioned in approach 1.

cc: @shwina if you have any insights on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone for option (2) for now.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on test strategies.

Partially replicates the implementation of _from_columns_like_self,
but makes it explicit where the dtype is cominng from for each column.
@codecov
Copy link

codecov bot commented Sep 13, 2022

Codecov Report

Base: 86.39% // Head: 86.42% // Increases project coverage by +0.03% 🎉

Coverage data is based on head (ed64e6a) compared to base (9f8db66).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head ed64e6a differs from pull request most recent head 2e06a63. Consider uploading reports for the commit 2e06a63 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.10   #11687      +/-   ##
================================================
+ Coverage         86.39%   86.42%   +0.03%     
================================================
  Files               145      145              
  Lines             23005    23017      +12     
================================================
+ Hits              19875    19893      +18     
+ Misses             3130     3124       -6     
Impacted Files Coverage Δ
python/cudf/cudf/utils/ioutils.py 79.47% <ø> (ø)
python/cudf/cudf/core/_base_index.py 85.54% <100.00%> (ø)
python/cudf/cudf/core/column/column.py 88.62% <100.00%> (ø)
python/cudf/cudf/core/column/lists.py 93.75% <100.00%> (+1.10%) ⬆️
python/cudf/cudf/core/dataframe.py 93.77% <100.00%> (+0.04%) ⬆️
python/cudf/cudf/core/frame.py 93.69% <100.00%> (+0.01%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.22% <100.00%> (+0.31%) ⬆️
python/cudf/cudf/core/index.py 92.63% <100.00%> (ø)
python/cudf/cudf/core/indexed_frame.py 92.19% <100.00%> (+0.01%) ⬆️
python/cudf/cudf/core/multiindex.py 92.46% <100.00%> (ø)
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Instead allow caller of _from_columns_like_self to provide an iterable
of dtypes that should override the dtypes picked up from the columns
of self.
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor refactoring proposals but I'm fine with any resolution on those (changed or not). Approving!

@wence-
Copy link
Contributor Author

wence- commented Sep 14, 2022

Failing against dask nightly due to dask/dask#9490

@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 19, 2022
@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b2ffea7 into rapidsai:branch-22.10 Sep 19, 2022
@wence- wence- deleted the wence/fix/explode-nested-dtype-copy branch September 20, 2022 08:51
rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Sep 21, 2022
… geocolumns specifically. (#684)

This PR follows up rapidsai/cudf#11687 to adapt to the change to `_copy_type_metadata` interface. This PR should unblock CI.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - H. Thomson Comer (https://github.com/thomcom)

URL: #684
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants