-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray #45352
[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray #45352
Conversation
Hi @terraflops1048576, thanks for fix! Can you also add a unit test? |
…bleShaped)TensorArray Signed-off-by: Peter Wang <[email protected]>
e2c0602
to
1e00179
Compare
I wasn't sure what unit test I should exactly add -- this PR is meant to change the list so its behavior still aligns with the existing unit tests, but works for large numbers of tensor elements. I added a unit test that checks that it actually works with the large numbers of tensors. |
Signed-off-by: Peter Wang <[email protected]>
4ba499e
to
7453149
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix! merging in master since it's been a few weeks, to ensure tests still pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix, thank you!
Signed-off-by: Peter Wang <[email protected]>
I think some of the test failures are unrelated, something to do with rllib. I see that one of the tests checks the size of the array; by my calculation, the array should use exactly 768 bytes to represent the tensors themselves, with 6 * 15 * 8 = 720 bytes for the values themselves, along with one 64-bit offset value per tensor. The rest is in the shape serialization, I believe. I updated the test to more accurately calculate the size, and hopefully it's within tolerance. |
Signed-off-by: Peter Wang <[email protected]>
Closes #46434
Why are these changes needed?
Currently, the ArrowTensorArray and ArrowVariableShapedTensorArray types use int32 to encode the list offsets. This means that within a given PyArrow chunk in a column, the sum of the sizes of all of the tensors in that chunk must be less than 2^31; otherwise, depending on the overflow conditions, an error is thrown or the data is truncated. This usually doesn't manifest itself in Ray Data with the default settings, because it splits the blocks up to meet the target max block size (though this can be turned off!). However, it unavoidably shows up when one needs a large local shuffle buffer to feed into Ray Train.
This PR changes the offsets to be stored in 64-bit integers and updates the corresponding storage types of the TensorArrays.
As an example:
fails with
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.