Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ffi_enforce_no_offset feature as a workaround hack to unavailable offset support in Arrow Java #5958

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/arrow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ jobs:
run: cargo test -p arrow
- name: Test arrow with all features except pyarrow
run: cargo test -p arrow --features=force_validate,prettyprint,ipc_compression,ffi,chrono-tz
- name: Test arrow with ffi_enforce_no_offset
run: cargo test -p arrow --features=force_validate,prettyprint,ipc_compression,ffi,ffi_enforce_no_offset,chrono-tz
- name: Run examples
run: |
# Test arrow examples
Expand Down Expand Up @@ -129,6 +131,8 @@ jobs:
run: cargo check -p arrow --no-default-features --all-targets --features test_utils
- name: Check compilation --no-default-features --all-targets --features ffi
run: cargo check -p arrow --no-default-features --all-targets --features ffi
- name: Check compilation --no-default-features --all-targets --features ffi,ffi_enforce_no_offset
run: cargo check -p arrow --no-default-features --all-targets --features ffi,ffi_enforce_no_offset
- name: Check compilation --no-default-features --all-targets --features chrono-tz
run: cargo check -p arrow --no-default-features --all-targets --features chrono-tz

Expand Down
14 changes: 14 additions & 0 deletions arrow-array/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,20 @@ hashbrown = { version = "0.14", default-features = false }
[features]
ffi = ["arrow-schema/ffi", "arrow-data/ffi"]

# HACKY: For offset-ed offset buffer of variable-sized binary array
# The valid FFI output is to record the offset buffer's offset into FFI's
# `offset` field and expose the beginning of the buffer as the pointer.
# So the consumer can calculate correctly the length of data buffer.
# However, due to Arrow Java's issue: https://github.com/apache/arrow/issues/42156,
# Arrow Java does not support `offset` field in the FFI output. It causes
# no sliced binary array can be passed between arrow-rs and Arrow Java.
# To workaround this issue, we have to enforce offset-ed offset buffer to
# be exposed at the sliced buffer's beginning. When calculating the length
# of data buffer, we assume the first element of the offset buffer is always 0
# (this is how Arrow Java and arrow-rs do). This is a hacky solution and
# should be removed once Arrow Java supports `offset` field in the FFI output.
ffi_enforce_no_offset = ["arrow-data/ffi_enforce_no_offset"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also potentially call this flag something like workaround_java_bug or something that makes it clearer when it would be used

However, given this documentation I think this name is ok too


[dev-dependencies]
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
criterion = { version = "0.5", default-features = false }
Expand Down
22 changes: 20 additions & 2 deletions arrow-array/src/ffi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,16 @@ impl<'a> ImportedArrowArray<'a> {
let start = (unsafe { *offset_buffer.add(0) }) as usize;
// get last offset
let end = (unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize;
end - start

if cfg!(feature = "ffi_enforce_no_offset") {
if end == start {
0
} else {
end
}
} else {
end - start
}
}
(DataType::LargeUtf8, 2) | (DataType::LargeBinary, 2) => {
// the len of the data buffer (buffer 2) equals the difference between the last value
Expand All @@ -450,7 +459,16 @@ impl<'a> ImportedArrowArray<'a> {
let start = (unsafe { *offset_buffer.add(0) }) as usize;
// get last offset
let end = (unsafe { *offset_buffer.add(len / size_of::<i64>() - 1) }) as usize;
end - start

if cfg!(feature = "ffi_enforce_no_offset") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if cfg!(feature = "ffi_enforce_no_offset") {
// Workaround bug in Arrow Java
// https://github.com/apache/arrow/issues/42156
if cfg!(feature = "ffi_enforce_no_offset") {

if end == start {
0
} else {
end
}
} else {
end - start
}
}
// buffer len of primitive types
_ => {
Expand Down
14 changes: 14 additions & 0 deletions arrow-data/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,20 @@ force_validate = []
# Enable ffi support
ffi = ["arrow-schema/ffi"]

# HACKY: For offset-ed offset buffer of variable-sized binary array
# The valid FFI output is to record the offset buffer's offset into FFI's
# `offset` field and expose the beginning of the buffer as the pointer.
# So the consumer can calculate correctly the length of data buffer.
# However, due to Arrow Java's issue: https://github.com/apache/arrow/issues/42156,
# Arrow Java does not support `offset` field in the FFI output. It causes
# no sliced binary array can be passed between arrow-rs and Arrow Java.
# To workaround this issue, we have to enforce offset-ed offset buffer to
# be exposed at the sliced buffer's beginning. When calculating the length
# of data buffer, we assume the first element of the offset buffer is always 0
# (this is how Arrow Java and arrow-rs do). This is a hacky solution and
# should be removed once Arrow Java supports `offset` field in the FFI output.
ffi_enforce_no_offset = []

[package.metadata.docs.rs]
features = ["ffi"]

Expand Down
6 changes: 4 additions & 2 deletions arrow-data/src/ffi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,9 @@ impl FFI_ArrowArray {
_ => None,
};

let offset = if let Some(offset) = offset_offset {
let offset = if cfg!(feature = "ffi_enforce_no_offset") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let offset = if cfg!(feature = "ffi_enforce_no_offset") {
// Workaround bug in Arrow Java
// https://github.com/apache/arrow/issues/42156
let offset = if cfg!(feature = "ffi_enforce_no_offset") {

data.offset()
} else if let Some(offset) = offset_offset {
if data.offset() != 0 {
// TODO: Adjust for data offset
panic!("The ArrayData of a slice offset buffer should not have offset");
Expand Down Expand Up @@ -184,7 +186,7 @@ impl FFI_ArrowArray {
| DataType::Binary
| DataType::LargeBinary,
1,
) => {
) if !cfg!(feature = "ffi_enforce_no_offset") => {
// For offset buffer, take original pointer without offset.
// Buffer offset should be handled by `FFI_ArrowArray` offset field.
Some(b.data_ptr().as_ptr() as *const c_void)
Expand Down
14 changes: 14 additions & 0 deletions arrow/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,20 @@ force_validate = ["arrow-data/force_validate"]
ffi = ["arrow-schema/ffi", "arrow-data/ffi", "arrow-array/ffi"]
chrono-tz = ["arrow-array/chrono-tz"]

# HACKY: For offset-ed offset buffer of variable-sized binary array
# The valid FFI output is to record the offset buffer's offset into FFI's
# `offset` field and expose the beginning of the buffer as the pointer.
# So the consumer can calculate correctly the length of data buffer.
# However, due to Arrow Java's issue: https://github.com/apache/arrow/issues/42156,
# Arrow Java does not support `offset` field in the FFI output. It causes
# no sliced binary array can be passed between arrow-rs and Arrow Java.
# To workaround this issue, we have to enforce offset-ed offset buffer to
# be exposed at the sliced buffer's beginning. When calculating the length
# of data buffer, we assume the first element of the offset buffer is always 0
# (this is how Arrow Java and arrow-rs do). This is a hacky solution and
# should be removed once Arrow Java supports `offset` field in the FFI output.
ffi_enforce_no_offset = ["arrow-data/ffi_enforce_no_offset", "arrow-array/ffi_enforce_no_offset"]

[dev-dependencies]
chrono = { workspace = true }
criterion = { version = "0.5", default-features = false }
Expand Down
Loading