Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

Closed
vipese-idoven opened this issue Jul 4, 2024 · 9 comments · Fixed by #45352 or #47832
Closed

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

vipese-idoven opened this issue Jul 4, 2024 · 9 comments · Fixed by #45352 or #47832
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@vipese-idoven
Copy link

vipese-idoven commented Jul 4, 2024

Description

The current implementation only allows to create ArrowVariableShapedTensorArray objects with a maximum number of (2^31)-1 elements because it uses PyArrow's ListArray in ray.air.util.tensor_extention.arrow L812 which uses 32-bit encoding for indexing. Thus, storing some types of data like long time-series which contain more elements than with 32-bit encoding causes overflow.

Providing the possibility to replace ListArray with Pyarrow LargeListArray would allow to store arrays with up to (2^63)-1 elements. (Note: this would also require to change the OFFSET_DTYPE in L722)

Use case

The goal is to be able to store long time-series in arrow format (like long audios, or audios with high sample frequencies).

@vipese-idoven vipese-idoven added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 4, 2024
@anyscalesam
Copy link
Contributor

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

@vipese-idoven
Copy link
Author

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

I've used Ray Data for batch processing, which can turn into very long signals. I was hoping to store the pre-processed data into arrow format for later segmentation and classification to avoid pre-processing again (or doing it on the fly).

@scottjlee
Copy link
Contributor

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

@scottjlee scottjlee changed the title [Ray Air] ArrowVariableShapedTensorArray with LargeListArray [Data] ArrowVariableShapedTensorArray with LargeListArray Jul 16, 2024
@scottjlee scottjlee added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 16, 2024
@vipese-idoven
Copy link
Author

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

Awesome! Happy to help if need be

@anyscalesam
Copy link
Contributor

@vipese-idoven lovely - can you take a look at the PR and failing release test? @terraflops1048576 is the PR author so please connect with him as required.

@vipese-idoven
Copy link
Author

Will do! @terraflops1048576 any chance we can connect offline and discuss this?

@terraflops1048576
Copy link
Contributor

Sorry, I haven't kept up with this project. I don't think there's much to discuss here -- though I'm open to answering questions about the codebase as I understand it (but you probably will get more accurate answers from the Anyscale people here).

The basic problem is that changing the ArrowTensorArray to use Arrow LargeLists is a relatively trivial matter -- just changing the type to pyarrow.large_list and changing a constant here or there. However, this breaks the release tests because the release tests use data stored in the old format, which is incompatible with the new format.

So a "proper" fix for this has to be the above changes, but however with a way to automatically convert or otherwise parse the old format. one way to do this might be to add in an ArrowLargeTensorArray and use that by default everywhere and write in a bunch of conversion code, but this seems like it'll make the code very messy.

@alexeykudinkin
Copy link
Contributor

Rebasing onto LargeListType will double the memory overhead per element going from int32 to int64, which is gonna be problematic and therefore we can't just blindly change ArrowTensorArray type.

Instead, i think we'd just stick with ChunkedArray that will be breaking down the arrays at the int32 boundary while making sure we're not overflowing int32 offsets.

@terraflops1048576
Copy link
Contributor

I disagree that this overhead is going to be problematic, and I think the solution is really as simple as changing the type to int64, but the problem is that it breaks all currently serialized ArrowTensorArrays.

Let's consider the two cases before us, the ArrowTensorArray and the ArrowVariableShapedTensorArray. For the ArrowTensorArray, the current overhead per tensor is 4 bytes, and the proposed change is to make it 8 bytes. The only pathological situation in which this would be problematic is if the user were storing numerous tensors of 10 floats or less, which is by far not the most common use case. Anyone who tried to turn these tensors into numpy arrays would experience an order of magnitude more overhead, just thanks to the metadata attached to numpy arrays. Furthermore, if this were an actual issue, then the underlying storage type should've been changed to the PyArrow FixedSizeListArray, which is as simple as changing this line in the constructor of ArrowTensorType:

super().__init__(pa.list_(dtype), "ray.data.arrow_tensor")

to:

super().__init__(pa.list_(dtype, np.prod(shape)), "ray.data.arrow_tensor")

This reduces the overhead to 8 bytes per column.

For the ArrowVariableShapedTensorArray, Ray already uses two 4-byte ints to represent the shape of each tensor, along with the 4-byte offset. This means that the corresponding increase in overhead is only 33%, and honestly, if that were truly an issue, one might even say that we should represent the shape with 2-byte ints and restrict each tensor to be less than 2^32 elements, which it already has to be!

However, all of this is a moot point. Most tensors are far bigger than the 4 or 8 or 12 or 16 bytes of overhead, and increasing complexity to save at most 1% of space is not worth it at all. There's even an argument to be made to simply get rid of ArrowTensorArray entirely and just use the ArrowVariableShapedTensorArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
5 participants