-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] ArrowVariableShapedTensorArray with LargeListArray #46434
Comments
@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files? |
I've used Ray Data for batch processing, which can turn into very long signals. I was hoping to store the pre-processed data into arrow format for later segmentation and classification to avoid pre-processing again (or doing it on the fly). |
There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests. |
Awesome! Happy to help if need be |
@vipese-idoven lovely - can you take a look at the PR and failing release test? @terraflops1048576 is the PR author so please connect with him as required. |
Will do! @terraflops1048576 any chance we can connect offline and discuss this? |
Sorry, I haven't kept up with this project. I don't think there's much to discuss here -- though I'm open to answering questions about the codebase as I understand it (but you probably will get more accurate answers from the Anyscale people here). The basic problem is that changing the ArrowTensorArray to use Arrow LargeLists is a relatively trivial matter -- just changing the type to pyarrow.large_list and changing a constant here or there. However, this breaks the release tests because the release tests use data stored in the old format, which is incompatible with the new format. So a "proper" fix for this has to be the above changes, but however with a way to automatically convert or otherwise parse the old format. one way to do this might be to add in an ArrowLargeTensorArray and use that by default everywhere and write in a bunch of conversion code, but this seems like it'll make the code very messy. |
Rebasing onto Instead, i think we'd just stick with |
I disagree that this overhead is going to be problematic, and I think the solution is really as simple as changing the type to int64, but the problem is that it breaks all currently serialized ArrowTensorArrays. Let's consider the two cases before us, the ArrowTensorArray and the ArrowVariableShapedTensorArray. For the ArrowTensorArray, the current overhead per tensor is 4 bytes, and the proposed change is to make it 8 bytes. The only pathological situation in which this would be problematic is if the user were storing numerous tensors of 10 floats or less, which is by far not the most common use case. Anyone who tried to turn these tensors into numpy arrays would experience an order of magnitude more overhead, just thanks to the metadata attached to numpy arrays. Furthermore, if this were an actual issue, then the underlying storage type should've been changed to the PyArrow FixedSizeListArray, which is as simple as changing this line in the constructor of ArrowTensorType:
to:
This reduces the overhead to 8 bytes per column. For the ArrowVariableShapedTensorArray, Ray already uses two 4-byte ints to represent the shape of each tensor, along with the 4-byte offset. This means that the corresponding increase in overhead is only 33%, and honestly, if that were truly an issue, one might even say that we should represent the shape with 2-byte ints and restrict each tensor to be less than 2^32 elements, which it already has to be! However, all of this is a moot point. Most tensors are far bigger than the 4 or 8 or 12 or 16 bytes of overhead, and increasing complexity to save at most 1% of space is not worth it at all. There's even an argument to be made to simply get rid of ArrowTensorArray entirely and just use the ArrowVariableShapedTensorArray. |
Description
The current implementation only allows to create
ArrowVariableShapedTensorArray
objects with a maximum number of (2^31)-1 elements because it uses PyArrow'sListArray
inray.air.util.tensor_extention.arrow
L812 which uses 32-bit encoding for indexing. Thus, storing some types of data like long time-series which contain more elements than with 32-bit encoding causes overflow.Providing the possibility to replace
ListArray
with Pyarrow LargeListArray would allow to store arrays with up to (2^63)-1 elements. (Note: this would also require to change theOFFSET_DTYPE
in L722)Use case
The goal is to be able to store long time-series in arrow format (like long audios, or audios with high sample frequencies).
The text was updated successfully, but these errors were encountered: