Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Request: Custom Operator Support for PyArrow Extension Types in Compute Functions #45208

Open
lllangWV opened this issue Jan 9, 2025 · 2 comments

Comments

@lllangWV
Copy link

lllangWV commented Jan 9, 2025

Enhancement Request: Custom Operator Support for PyArrow Extension Types in Compute Functions

Hello!

I have been using the PyArrow extension capability to define custom types, which is extremely useful for extending Arrow's functionality. However, a significant limitation arises when using these custom types with compute functions.

For example, the FixedShapeTensorType type, designed as an extension type for ndarrays, triggers an error when used with the pc.equal function to compare arrays:

Example Code

import pyarrow as pa
import pyarrow.compute as pc

tensor_type = pa.fixed_shape_tensor(pa.int32(), (2, 2))

arr_1 = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
storage_1 = pa.array(arr_1, pa.list_(pa.int32(), 4))
tensor_array_1 = pa.ExtensionArray.from_storage(tensor_type, storage_1)

arr_2 = [[1, 3, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
storage_2 = pa.array(arr_2, pa.list_(pa.int32(), 4))
tensor_array_2 = pa.ExtensionArray.from_storage(tensor_type, storage_2)

# This triggers an error
print(pc.equal(tensor_array_1, tensor_array_2))

Error Message

  return func.call(args, None, memory_pool)
  File "pyarrow\\_compute.pyx", line 385, in pyarrow._compute.Function.call
  File "pyarrow\\error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.fixed_shape_tensor[value_type=int32, shape=[2,2]]>, extension<arrow.fixed_shape_tensor[value_type=int32, shape=[2,2]]>)

Proposed Solution

I believe it would be highly useful for PyArrow to allow users to define custom operator support for extension types, similar to how Pandas enables operator support for ExtensionArray.

Suggested Implementation

Here’s an example for the interface:

class PythonObjectArrowType(pa.ExtensionType):
    def __init__(self):
        super().__init__(pa.binary(), "parquetdb.PythonObjectArrow")

    def __arrow_ext_serialize__(self):
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return PythonObjectArrowType()

    def __arrow_ext_class__(self):
        return PythonObjectArrowArray

    def to_pandas_dtype(self):
        return PythonObjectPandasDtype()

    def __arrow_ext_scalar_class__(self):
        return PythonObjectArrowScalar


pa.register_extension_type(PythonObjectArrowType())


class PythonObjectArrowScalar(pa.ExtensionScalar):
    def as_py(self):
        return data_utils.load_python_object(self.value.as_py())

    def __eq__(self, other):
        return self.value == other.value


class PythonObjectArrowArray(pa.ExtensionArray):
    def to_pandas(self, **kwargs):
        values = self.storage.to_numpy(zero_copy_only=False)
        results = mp_utils.parallel_apply(data_utils.load_python_object, values)
        return pd.Series(results)

    def to_values(self, **kwargs):
        values = self.storage.to_pandas(**kwargs).values
        results = mp_utils.parallel_apply(data_utils.load_python_object, values)
        return results

In this example, the PythonObjectArrowScalar class defines an __eq__ method, enabling custom equality comparisons for the scalar elements. Similarly, the PythonObjectArrowArray class can provide custom implementations for data conversion and manipulation.

Challenges

While defining __eq__ in the scalar class is straightforward, I am uncertain how this would integrate into compute functions like pc.equal. It may require exposing additional hooks or mechanisms in PyArrow to allow users to register their operator implementations.

Please let me know if additional details or examples are needed.

Best,

Logan Lang

Component(s)

C++, Python

@raulcd
Copy link
Member

raulcd commented Jan 10, 2025

That's an interesting idea but we probably have to fix nested types first which are currently failing too. More details here:

@AlenkaF
Copy link
Member

AlenkaF commented Jan 10, 2025

I think there are two different things mentioned in this issue, one is compute kernels and the support of them for extension types. The other is the Python comparison operators for the Extension arrays.

I am quite sure defining __eq__ method on Scalar object will not solve the fact that some kernels, equals in this example, are not supported for Extension types in C++.

There is an issue opened that covers kernel support for ExtensionTypes and I think it would be worth moving it forward, see #22304. Also connected to the kernels: #33452.

On the other hand it would be worth investigating a bit more, how using Python equality operators could be improved for Extension arrays. Currently we check type equality and value of the storage separately in the tests.

I think the issue connected to this might be: #24348.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants