Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] For extension types, compute kernels should default to storage types? #33452

Open
asfimport opened this issue Nov 7, 2022 · 1 comment

Comments

@asfimport
Copy link
Collaborator

Currently, compute kernels don't recognize extensions types so that if you were to define semantic types to indicate things like "this string column is an image label", you then cannot do things like equals on it.

For example, take the LabelType from https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: class LabelType(pa.PyExtensionType):
...:
...:     def __init__(self):
...:         pa.PyExtensionType.__init__(self, pa.string())
...:
...:     def __reduce__(self):
...:         return LabelType, ()
...:

In [4]: tbl = pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), pa.array(['cat', 'dog', 'person']))], names=['label'])

In [5]: tbl.filter(pc.field('label') == 'cat')
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
Cell In [5], line 1
----> 1 tbl.filter(pc.field('label') == 'cat')

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in pyarrow.lib.Table.filter()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, in pyarrow._exec_plan._filter_table()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, in pyarrow._exec_plan.execplan()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.py_extension_type<LabelType>>, string)

for query systems that push some of the compute down to Arrow (e.g., DuckDB), it also means that it's much harder for users to work with datasets with extension types because you don't know which functions will actually work.

Instead, if we can make the compute kernels default to the storage type, it would make the extension system a lot easier to work with in Arrow.

Reporter: Chang She / @changhiskhan

Note: This issue was originally created as ARROW-18273. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
I think this makes good sense, although I'm not sure about the implementation details of it. I think many (all?) kernels specify their allowed input types before runtime, but perhaps there is a way match based on storage type as well?
cc @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant