Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Support ExtensionType arrays in more kernels #22304

Open
asfimport opened this issue Jul 9, 2019 · 9 comments
Open

[C++][Python] Support ExtensionType arrays in more kernels #22304

asfimport opened this issue Jul 9, 2019 · 9 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jul 9, 2019

From a quick test (through Python), it seems that slice and take work, but the following not:

  • cast: it could rely on the casting rules for the storage type. Or do we want that you explicitly have to take the storage array before casting?
  • dictionary_encode / unique

Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-5890. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
We should make a list of compute functions that could might support extension types in a generic way. cc @ianmcook

cast wouldn't work: if you have an extension type representing IPv4 addresses on top of Int32s, the "cast to string" function should output the IPv4 addresses in dotted format ("1.2.3.4"), not in raw decimal format.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Specifically for the cast case, would we want to have some extension point in the ExtensionType to control the casting?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
In addition to the dictionary_encode/unique/value_counts from above, some other kernels that could be supported (but are not at the moment):

  • "equal" / "not_equal" (other comparisons are trickier, as that would rely on the ordering of the storage type being meaningful or not)

  • "if_else", "case_when", "choose", "coalesce"

  • "fill_null", "replace_with_mask"

  • "is_in", "index", "index_in"

    Already working:

  • "filter", "take"

  • "is_null"

  • "make_struct"

    The sorting-related ones (sort_indices, partition_nth_indices, etc) are probably tricky, as there is not guaranteed a link between the order of the storage type and the logical type.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:

Specifically for the cast case, would we want to have some extension point in the ExtensionType to control the casting?

Perhaps, though I'm not sure what it should look like. Perhaps like this:

class ExtensionType {
  // ...
  bool CanCastTo(const DataType& to_type);

  Result<Datum> CastTo(const Datum& value, std::shared_ptr<DataType> to_type,
                   const CastOptions& options = CastOptions::Safe(),
                   ExecContext* ctx = NULLPTR);
}

cc @bkietz

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
cc @lidavidm

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Casts are implemented by registering cast_boolean, cast_int8, ..., right? We could perhaps detect a registered "cast_extension_ipv4" or things like that.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
That requires implementers of extension type to implement and register their own cast function, which sounds a bit more annoying.

@asfimport
Copy link
Collaborator Author

Clark Zinzow:
@pitrou I'm working on a tensor column extension type similar to this one and was hoping to allow users to interpret Parquet columns containing bytes blobs (e.g. images) as tensors by having them provide a schema for those columns, where the column's dtype is a tensor array extension type instantiated with the requisite data (shape, dtype, etc.) to cast that column as a tensor array. Since there isn't a static conversion between the bytes blobs and the underlying extension array dtype (both the shape and the underlying element dtype is parameterizable), it'd be nice if an extension type could register a cast function so we could use the shape and dtype context to properly interpret those bytes blobs.

@asfimport
Copy link
Collaborator Author

Clark Zinzow:
Does allowing extension type implementers to register a cast function sound reasonable? I might be able to take a stab at this (just casting) in the coming months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant