-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfer correct dtype to exploded column #11687
Transfer correct dtype to exploded column #11687
Conversation
Since libcudf doesn't keep track of StructDtype key names, round-tripping through outer_explode loses information. We know the correct dtype, since it is the element_type of the exploded list column, so use attach that type metadata before handing back the return value.
).element_type | ||
exploded[explode_column_num] = exploded[ | ||
explode_column_num | ||
]._with_type_metadata(element_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this copying of type metadata happen after self._from_columns_like_self
call? i.e., on the result of self._from_columns_like_self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but you would be forgiven for thinking so. This is exploiting (although I didn't notice that this was what was going on until responding here) fallthrough behaviour in _from_columns_like_self
in the case where the dtype shape of the target and source frames does not match.
_from_columns_like_self
attempts to copy type metadata from self
onto other
. The interesting part for this argument is how the columns are collected
new_columns = [col._with_type_metadata(self_col.dtype) for col, self_col in zip(other._columns, self._columns)]
OK, so for the non-exploded columns this does the right thing (it's a no-op because the dtypes match). For the exploded column the dtypes do not match. The self
version is ListDtype(element_type)
the other
version is (structurally at least) is element_type
. So eventually we end up calling struct_column._with_type_metadata(list_dtype)
:
def _with_type_metadata(self: StructColumn, dtype: Dtype) -> StructColumn:
from cudf.core.column import IntervalColumn
from cudf.core.dtypes import IntervalDtype
# Check IntervalDtype first because it's a subclass of StructDtype
if isinstance(dtype, IntervalDtype):
return IntervalColumn.from_struct_column(self, closed=dtype.closed)
elif isinstance(dtype, StructDtype):
return build_struct_column(
names=dtype.fields.keys(),
children=tuple(
self.base_children[i]._with_type_metadata(dtype.fields[f])
for i, f in enumerate(dtype.fields.keys())
),
mask=self.base_mask,
size=self.size,
offset=self.offset,
null_count=self.null_count,
)
return self
We don't tick one of the specific cases, so just return ourselves, and come back up through the call tree.
Given the use case, I'm not sure why we don't just update the dtype of the exploded column and then call _from_columns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do either one of these:
- Introduce a new parameter
cols_to_ignore_copy_metadata
to_from_columns_like_self
which will ignore copying metadata of the columns while constructing dataframe and will just copy the dtypes of columns that are not incols_to_ignore_copy_metadata
. - Avoid calling
self._from_columns_like_self
but then hand construct the frame by carefully taking into account which dtypes to be copied to new columns. The reason being, we are constructing a dataframe that is a mix of self-like columns and non-self-like columns.
I feel approach 2 might be the desired solution as this may be a one-off. But as list & struct operations expand more we might need to have some kind of capability as mentioned in approach 1.
cc: @shwina if you have any insights on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone for option (2) for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on test strategies.
Partially replicates the implementation of _from_columns_like_self, but makes it explicit where the dtype is cominng from for each column.
Codecov ReportBase: 86.39% // Head: 86.42% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11687 +/- ##
================================================
+ Coverage 86.39% 86.42% +0.03%
================================================
Files 145 145
Lines 23005 23017 +12
================================================
+ Hits 19875 19893 +18
+ Misses 3130 3124 -6
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Instead allow caller of _from_columns_like_self to provide an iterable of dtypes that should override the dtypes picked up from the columns of self.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor refactoring proposals but I'm fine with any resolution on those (changed or not). Approving!
Failing against dask nightly due to dask/dask#9490 |
rerun tests |
@gpucibot merge |
… geocolumns specifically. (#684) This PR follows up rapidsai/cudf#11687 to adapt to the change to `_copy_type_metadata` interface. This PR should unblock CI. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Mark Harris (https://github.com/harrism) - H. Thomson Comer (https://github.com/thomcom) URL: #684
Description
Since libcudf doesn't keep track of StructDtype key names, round-tripping through outer_explode loses information. We know the correct dtype, since it is the element_type of the exploded list column, so attach that type metadata before handing back the return value.
Exploding a list series should be equivalent to unwrapping one level of list from the dtype, so that
Previously this was not the case, since we would lose the names resulting in
Checklist