-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add ffi_enforce_no_offset
feature as a workaround hack to unavailable offset support in Arrow Java
#5958
Conversation
…ilable offset support in Arrow Java
a8e1f57
to
9f85667
Compare
cc @andygrove @alamb |
I'm not sure about this, we'd be committing to supporting a workaround for arrow Java for at least 3 months (next breaking release). It feels like we should just fix this in arrow Java, no? |
The change for Arrow Java is considered to be big one (it doesn't support Instead of keeping using a forked branch of arrow-rs and DataFusion just for the issue, after discussed with @andygrove, we'd like to have this workaround in arrow-rs hopefully, and guard it with a feature with clear comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, we'd be committing to supporting a workaround for arrow Java for at least 3 months (next breaking release). It feels like we should just fix this in arrow Java, no?
I believe the rationale for putting this into arrow-rs is that this issue is blocking the first release of DataFusion comet.
The only alternative I can think of is for comet to rely on a patched version of arrow-rs.
I agree that this is a hack and that the real fix is in Arrow java -- this is tracked by https://github.com/apache/arrow/issues/42156 and it seems there is agreement upstream that it should be fixed in Arrow. However, even once it is fixed in Arrow java it will take some non trivial time to make it into Spark.
I think this is one of those times when the workaround is non ideal, but I think it is worth it for compatibility and help the community along.
Maybe we could file a ticket to track removing the flag once the arrow java fix has propagates sufficiently in the ecosystem
# of data buffer, we assume the first element of the offset buffer is always 0 | ||
# (this is how Arrow Java and arrow-rs do). This is a hacky solution and | ||
# should be removed once Arrow Java supports `offset` field in the FFI output. | ||
ffi_enforce_no_offset = ["arrow-data/ffi_enforce_no_offset"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also potentially call this flag something like workaround_java_bug
or something that makes it clearer when it would be used
However, given this documentation I think this name is ok too
@@ -450,7 +459,16 @@ impl<'a> ImportedArrowArray<'a> { | |||
let start = (unsafe { *offset_buffer.add(0) }) as usize; | |||
// get last offset | |||
let end = (unsafe { *offset_buffer.add(len / size_of::<i64>() - 1) }) as usize; | |||
end - start | |||
|
|||
if cfg!(feature = "ffi_enforce_no_offset") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cfg!(feature = "ffi_enforce_no_offset") { | |
// Workaround bug in Arrow Java | |
// https://github.com/apache/arrow/issues/42156 | |
if cfg!(feature = "ffi_enforce_no_offset") { |
@@ -152,7 +152,9 @@ impl FFI_ArrowArray { | |||
_ => None, | |||
}; | |||
|
|||
let offset = if let Some(offset) = offset_offset { | |||
let offset = if cfg!(feature = "ffi_enforce_no_offset") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let offset = if cfg!(feature = "ffi_enforce_no_offset") { | |
// Workaround bug in Arrow Java | |
// https://github.com/apache/arrow/issues/42156 | |
let offset = if cfg!(feature = "ffi_enforce_no_offset") { |
I'm a little confused by this, since we moved away from ArrayData we no longer store offsets in arrays other than BooleanArray https://github.com/apache/arrow-rs/blob/master/arrow-array%2Fsrc%2Farray%2Fbyte_array.rs#L456 The only way to end up with an offset is manually slicing ArrayData directly. If you convert back to an array using I wonder if comet could simply consistently use Arrays to avoid this class of issue, or call make_array at FFI boundaries? TBC I'm not against hacky workarounds on principle but I wonder if there are less hacky ways to unblock comet |
Hm? Comet never uses ArrayData. Come only uses Arrays. The slicing is happened on arrays (e.g., string array) and of course it is materialized the offset to underlying buffer (offset buffer for string array). Such sliced offset buffer, when exposed through C Data interface, will make the consumer to calculate a incorrect length of data buffer (fixed in apache/arrow#5895). After apache/arrow#5895, the offset of offset buffer is represented in However, it is not enough when working with (current) Arrow Java. Arrow Java doesn't support this This workaround is proposed for that. Before a complete support of offset in Arrow Java, this helps Comet pass sliced arrays through arrow-rs -> Arrow Java -> arrow-rs... |
I've filed #5964 which I think fixes the actual issue at play here., namely a regression caused by #5741. I suspect there is still an issue in arrow-java, in that the way it calculates the length of the data buffer is simply wrong, and was incorrectly changed to match in arrow-rs in #5741. However, I think we should fix the regression caused by #5741 and then open further discussions about any remaining compatibility issues with arrow-java and how we resolve them |
Closing this as I made incorrect change to arrow-rs to match Arrow Java. apache/arrow#5964 should fix the issue. |
* fix: Fix range out of index error by using custom arrow-rs repo * Add custom Java Arrow classes * Add a hack * Update * Update * Update to use apache/arrow-rs#5958 * Use tustvold's branch * Use official arrow-rs repo
…#584) * fix: Fix range out of index error by using custom arrow-rs repo * Add custom Java Arrow classes * Add a hack * Update * Update * Update to use apache/arrow-rs#5958 * Use tustvold's branch * Use official arrow-rs repo
Which issue does this PR close?
Closes apache/arrow#5959.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?