Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer correct dtype to exploded column #11687

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion python/cudf/cudf/core/indexed_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
from cudf.core._base_index import BaseIndex
from cudf.core.column import ColumnBase, as_column, full
from cudf.core.column_accessor import ColumnAccessor
from cudf.core.dtypes import ListDtype
from cudf.core.frame import Frame
from cudf.core.groupby.groupby import GroupBy
from cudf.core.index import Index, RangeIndex, _index_from_columns
Expand Down Expand Up @@ -3487,7 +3488,15 @@ def _explode(self, explode_column: Any, ignore_index: bool):
],
explode_column_num,
)

# dtype of exploded column is the element dtype of the list
# column that was exploded, this needs to be copied over so
# that nested struct dtypes maintain the key names.
element_type = cast(
ListDtype, self._data[explode_column].dtype
).element_type
exploded[explode_column_num] = exploded[
explode_column_num
]._with_type_metadata(element_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this copying of type metadata happen after self._from_columns_like_self call? i.e., on the result of self._from_columns_like_self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but you would be forgiven for thinking so. This is exploiting (although I didn't notice that this was what was going on until responding here) fallthrough behaviour in _from_columns_like_self in the case where the dtype shape of the target and source frames does not match.

_from_columns_like_self attempts to copy type metadata from self onto other. The interesting part for this argument is how the columns are collected

new_columns = [col._with_type_metadata(self_col.dtype) for col, self_col in zip(other._columns, self._columns)]

OK, so for the non-exploded columns this does the right thing (it's a no-op because the dtypes match). For the exploded column the dtypes do not match. The self version is ListDtype(element_type) the other version is (structurally at least) is element_type. So eventually we end up calling struct_column._with_type_metadata(list_dtype):

    def _with_type_metadata(self: StructColumn, dtype: Dtype) -> StructColumn:
        from cudf.core.column import IntervalColumn
        from cudf.core.dtypes import IntervalDtype

        # Check IntervalDtype first because it's a subclass of StructDtype
        if isinstance(dtype, IntervalDtype):
            return IntervalColumn.from_struct_column(self, closed=dtype.closed)
        elif isinstance(dtype, StructDtype):
            return build_struct_column(
                names=dtype.fields.keys(),
                children=tuple(
                    self.base_children[i]._with_type_metadata(dtype.fields[f])
                    for i, f in enumerate(dtype.fields.keys())
                ),
                mask=self.base_mask,
                size=self.size,
                offset=self.offset,
                null_count=self.null_count,
            )

        return self

We don't tick one of the specific cases, so just return ourselves, and come back up through the call tree.

Given the use case, I'm not sure why we don't just update the dtype of the exploded column and then call _from_columns.

Copy link
Contributor

@galipremsagar galipremsagar Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do either one of these:

  1. Introduce a new parameter cols_to_ignore_copy_metadata to _from_columns_like_self which will ignore copying metadata of the columns while constructing dataframe and will just copy the dtypes of columns that are not in cols_to_ignore_copy_metadata.
  2. Avoid calling self._from_columns_like_self but then hand construct the frame by carefully taking into account which dtypes to be copied to new columns. The reason being, we are constructing a dataframe that is a mix of self-like columns and non-self-like columns.

I feel approach 2 might be the desired solution as this may be a one-off. But as list & struct operations expand more we might need to have some kind of capability as mentioned in approach 1.

cc: @shwina if you have any insights on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone for option (2) for now.

return self._from_columns_like_self(
exploded,
self._column_names,
Expand Down
23 changes: 23 additions & 0 deletions python/cudf/cudf/tests/test_list.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

import functools
import operator
from string import printable

import numpy as np
import pandas as pd
import pyarrow as pa
import pytest
from hypothesis import given, settings, strategies as st

import cudf
from cudf import NA
Expand Down Expand Up @@ -107,6 +109,27 @@ def test_listdtype_hash():
assert hash(a) != hash(c)


@settings(deadline=None)
@given(
st.builds(
lambda x: [x],
st.recursive(
st.integers(min_value=-100, max_value=100),
lambda c: st.lists(c, min_size=1, max_size=1)
| st.dictionaries(
st.text(printable, min_size=1, max_size=20),
c,
min_size=1,
max_size=10,
),
),
)
)
def test_list_dtype_explode(x):
sr = cudf.Series([x])
assert sr.dtype.element_type == sr.explode().dtype


@pytest.mark.parametrize(
"data",
[
Expand Down