[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

vipese-idoven · 2024-07-04T06:57:03Z

Description

The current implementation only allows to create ArrowVariableShapedTensorArray objects with a maximum number of (2^31)-1 elements because it uses PyArrow's ListArray in ray.air.util.tensor_extention.arrow L812 which uses 32-bit encoding for indexing. Thus, storing some types of data like long time-series which contain more elements than with 32-bit encoding causes overflow.

Providing the possibility to replace ListArray with Pyarrow LargeListArray would allow to store arrays with up to (2^63)-1 elements. (Note: this would also require to change the OFFSET_DTYPE in L722)

Use case

The goal is to be able to store long time-series in arrow format (like long audios, or audios with high sample frequencies).

The text was updated successfully, but these errors were encountered:

anyscalesam · 2024-07-08T21:24:18Z

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

vipese-idoven · 2024-07-09T07:03:32Z

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

I've used Ray Data for batch processing, which can turn into very long signals. I was hoping to store the pre-processed data into arrow format for later segmentation and classification to avoid pre-processing again (or doing it on the fly).

scottjlee · 2024-07-16T19:27:14Z

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

vipese-idoven · 2024-07-16T20:45:37Z

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

Awesome! Happy to help if need be

anyscalesam · 2024-07-17T17:21:47Z

@vipese-idoven lovely - can you take a look at the PR and failing release test? @terraflops1048576 is the PR author so please connect with him as required.

vipese-idoven · 2024-07-18T13:41:06Z

Will do! @terraflops1048576 any chance we can connect offline and discuss this?

terraflops1048576 · 2024-08-03T06:15:39Z

Sorry, I haven't kept up with this project. I don't think there's much to discuss here -- though I'm open to answering questions about the codebase as I understand it (but you probably will get more accurate answers from the Anyscale people here).

The basic problem is that changing the ArrowTensorArray to use Arrow LargeLists is a relatively trivial matter -- just changing the type to pyarrow.large_list and changing a constant here or there. However, this breaks the release tests because the release tests use data stored in the old format, which is incompatible with the new format.

So a "proper" fix for this has to be the above changes, but however with a way to automatically convert or otherwise parse the old format. one way to do this might be to add in an ArrowLargeTensorArray and use that by default everywhere and write in a bunch of conversion code, but this seems like it'll make the code very messy.

alexeykudinkin · 2024-09-24T23:58:45Z

Rebasing onto LargeListType will double the memory overhead per element going from int32 to int64, which is gonna be problematic and therefore we can't just blindly change ArrowTensorArray type.

Instead, i think we'd just stick with ChunkedArray that will be breaking down the arrays at the int32 boundary while making sure we're not overflowing int32 offsets.

terraflops1048576 · 2024-09-25T01:33:04Z

I disagree that this overhead is going to be problematic, and I think the solution is really as simple as changing the type to int64, but the problem is that it breaks all currently serialized ArrowTensorArrays.

Let's consider the two cases before us, the ArrowTensorArray and the ArrowVariableShapedTensorArray. For the ArrowTensorArray, the current overhead per tensor is 4 bytes, and the proposed change is to make it 8 bytes. The only pathological situation in which this would be problematic is if the user were storing numerous tensors of 10 floats or less, which is by far not the most common use case. Anyone who tried to turn these tensors into numpy arrays would experience an order of magnitude more overhead, just thanks to the metadata attached to numpy arrays. Furthermore, if this were an actual issue, then the underlying storage type should've been changed to the PyArrow FixedSizeListArray, which is as simple as changing this line in the constructor of ArrowTensorType:

super().__init__(pa.list_(dtype), "ray.data.arrow_tensor")

to:

super().__init__(pa.list_(dtype, np.prod(shape)), "ray.data.arrow_tensor")

This reduces the overhead to 8 bytes per column.

For the ArrowVariableShapedTensorArray, Ray already uses two 4-byte ints to represent the shape of each tensor, along with the 4-byte offset. This means that the corresponding increase in overhead is only 33%, and honestly, if that were truly an issue, one might even say that we should represent the shape with 2-byte ints and restrict each tensor to be less than 2^32 elements, which it already has to be!

However, all of this is a moot point. Most tensors are far bigger than the 4 or 8 or 12 or 16 bytes of overhead, and increasing complexity to save at most 1% of space is not worth it at all. There's even an argument to be made to simply get rid of ArrowTensorArray entirely and just use the ArrowVariableShapedTensorArray.

vipese-idoven added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 4, 2024

anyscalesam added the data Ray Data-related issues label Jul 10, 2024

scottjlee mentioned this issue Jul 16, 2024

[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray #45352

Merged

8 tasks

scottjlee changed the title ~~[Ray Air] ArrowVariableShapedTensorArray with LargeListArray~~ [Data] ArrowVariableShapedTensorArray with LargeListArray Jul 16, 2024

scottjlee added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 16, 2024

alexeykudinkin mentioned this issue Sep 24, 2024

[Datasets] Concatenating large chunks of string/binary/list columns in Arrow tables can cause offset overflow. #29492

Closed

alexeykudinkin self-assigned this Sep 27, 2024

alexeykudinkin linked a pull request Sep 27, 2024 that will close this issue

[Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb #47832

Merged

8 tasks

raulchen closed this as completed in #47832 Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

vipese-idoven commented Jul 4, 2024 •

edited

Loading

anyscalesam commented Jul 8, 2024

vipese-idoven commented Jul 9, 2024

scottjlee commented Jul 16, 2024

vipese-idoven commented Jul 16, 2024

anyscalesam commented Jul 17, 2024

vipese-idoven commented Jul 18, 2024

terraflops1048576 commented Aug 3, 2024

alexeykudinkin commented Sep 24, 2024

terraflops1048576 commented Sep 25, 2024

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

Comments

vipese-idoven commented Jul 4, 2024 • edited Loading

Description

Use case

anyscalesam commented Jul 8, 2024

vipese-idoven commented Jul 9, 2024

scottjlee commented Jul 16, 2024

vipese-idoven commented Jul 16, 2024

anyscalesam commented Jul 17, 2024

vipese-idoven commented Jul 18, 2024

terraflops1048576 commented Aug 3, 2024

alexeykudinkin commented Sep 24, 2024

terraflops1048576 commented Sep 25, 2024

vipese-idoven commented Jul 4, 2024 •

edited

Loading