Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC format support for StringViewArray and BinaryViewArray #5525

Merged
merged 11 commits into from
Apr 1, 2024

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Mar 17, 2024

Which issue does this PR close?

Part of #5506.

Rationale for this change

Added necessary changes to handle the BinaryView and Utf8View in IPC reader/writer.

What changes are included in this PR?

The changes are slightly larger than expected because the BinaryView and Utf8View has variadicBufferCounts which no other types had before.

Currently implementation ignores the offset of Binary/Utf8View Array's offsets, meaning that the entire buffers will be serialize to the IPC buffer. This might write more data than necessary. Slicing and writing a view array to IPC buffer is non-trivial and was left as future work.

In #5506, @alamb mentioned (3) the integration tests. I'm not entirely sure how to do this. Should we generate some arrow data and commit to the testing repository and then add more tests to the arrow-rs/arrow-integration-testing/tests /ipc_reader.rs?

Are there any user-facing changes?

I just learned the arrow memory layout today, so I expect quite a lot of corner cases I didn't handle, please feel free to comment as you see anything non-intuitive :-)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 17, 2024
arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao -- this is looking great

I left a bunch of minor comments -- but the only thing that is missing from making this mergable in my opinion are some round trip tests (to ensure that we can write these arrays to an IPC file/stream and then read them back and get the same result)

I think we can use the same roundtrip pattern here:

#[test]
fn test_roundtrip_stream_run_array_sliced() {
let run_array_1: Int32RunArray = vec!["a", "a", "a", "b", "b", "c", "c", "c"]
.into_iter()
.collect();
let run_array_1_sliced = run_array_1.slice(2, 5);
let run_array_2_inupt = vec![Some(1_i32), None, None, Some(2), Some(2)];
let mut run_array_2_builder = PrimitiveRunBuilder::<Int16Type, Int32Type>::new();
run_array_2_builder.extend(run_array_2_inupt);
let run_array_2 = run_array_2_builder.finish();
let schema = Arc::new(Schema::new(vec![
Field::new(
"run_array_1_sliced",
run_array_1_sliced.data_type().clone(),
false,
),
Field::new("run_array_2", run_array_2.data_type().clone(), false),
]));
let input_batch = RecordBatch::try_new(
schema,
vec![Arc::new(run_array_1_sliced.clone()), Arc::new(run_array_2)],
)
.unwrap();
let output_batch = roundtrip_ipc_stream(&input_batch);
// As partial comparison not yet supported for run arrays, the sliced run array
// has to be unsliced before comparing with the output. the second run array
// can be compared as such.
assert_eq!(input_batch.column(1), output_batch.column(1));
let run_array_1_unsliced = unslice_run_array(run_array_1_sliced.into_data()).unwrap();

The cases to cover are:

  1. Basic BinaryView / Utf8View
  2. Sliced BinaryView / Utf8View
  3. Nested BinaryView/Utf8View in Dictionary/Struct/List (to cover the code in set_variadic_buffer_counts)

Currently implementation ignores the offset of Binary/Utf8View Array's offsets, meaning that the entire buffers will be serialize to the IPC buffer. This might write more data than necessary. Slicing and writing a view array to IPC buffer is non-trivial and was left as future work.

I think the IPC serializer should just serialize the raw arrays as given and not try to optimize anything. If users wants to "compact" the arrays prior to sending them over IPC I think it should be an explicit choice and they can do it via the gc API suggested in #5513

In #5506, @alamb mentioned (3) the integration tests. I'm not entirely sure how to do this. Should we generate some arrow data and commit to the testing repository and then add more tests to the arrow-rs/arrow-integration-testing/tests /ipc_reader.rs?

Maybe @bkietz knows if we have added StringViewArrays to the integration test suite already. I did not see any commits in https://github.com/apache/arrow-testing/commits/master that have such files.

If we don't have such files, I think we should add them / work with the other language teams to add them for compatibility as a follow on task. I can file tickets to track this

Thank you @ariesdevil and @viirya for the revies

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
arrow-ipc/src/reader.rs Show resolved Hide resolved
arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
arrow-ipc/src/reader.rs Show resolved Hide resolved
arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
@bkietz
Copy link
Member

bkietz commented Mar 19, 2024

@alamb
We have added Utf8View to archery integration testing here. C++ <-> Go passes through both IPC and cABI (== arrow-rs::ffi). A PR to remove the skip on rust should add arrow-rs to the party.

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
// The spec is not clear on whether the view/null buffer should be included in the variadic buffer count.
// But from C++ impl https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/ipc/writer.cc#L477
// we know they are not included.
counts.push(array.to_data().buffers().len() as i64 - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
counts.push(array.to_data().buffers().len() as i64 - 1);
counts.push(array.data_buffers().len() as i64);

Copy link
Contributor Author

@XiangpengHao XiangpengHao Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_buffers() is only available when the array is casted down to GenericByteViewArray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pattern, it must be a GenericByteViewArray, so using data_buffers() here is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but it's quite verbose to first cast each type in to BinaryView or Utf8View and then call data_buffers()

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

Had a brief look and I like where this is headed. I left some comments, but other than those already suggested by others, I wonder if we could integrate the variadicBuffer collection into the existing logic to traverse the nested types. This would be quicker, simpler and probably easier to maintain

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
XiangpengHao and others added 2 commits March 20, 2024 13:27
Co-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
@XiangpengHao XiangpengHao marked this pull request as draft March 20, 2024 20:29
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @XiangpengHao

What is left before we can merge this PR?

It seems like the two remaining items are:

  1. Roundtrip tests (just in this repo, as described in IPC format support for StringViewArray and BinaryViewArray #5525 (review))
  2. Maybe a follow up ticket to track adding archery testing (e.g. IPC format support for StringViewArray and BinaryViewArray #5525 (comment))

Is there anything else?

@XiangpengHao
Copy link
Contributor Author

XiangpengHao commented Mar 22, 2024

Hi @XiangpengHao

What is left before we can merge this PR?

It seems like the two remaining items are:

  1. Roundtrip tests (just in this repo, as described in IPC format support for StringViewArray and BinaryViewArray #5525 (review))
  2. Maybe a follow up ticket to track adding archery testing (e.g. IPC format support for StringViewArray and BinaryViewArray #5525 (comment))

Is there anything else?

I think those are the two major todos. Sorry I got quite busy these days, will try to address them in a few days.

@alamb
Copy link
Contributor

alamb commented Mar 22, 2024

I think those are the two major todos. Sorry I got quite busy these days, will try to address them in a few days.

No worries!

I think we could merge this PR with just the first (round trip tests) and then do the integration test in a follow on PR

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
@XiangpengHao XiangpengHao marked this pull request as ready for review March 27, 2024 16:41
@XiangpengHao
Copy link
Contributor Author

Finally get back to this! I checked in the roundtrip tests and fixed bugs related to dictionary encodings. Can you @tustvold @alamb take a look again?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XiangpengHao -- I think this is really close.

I think the tests need a few tweaks and fix the CI but then this will be good to go.

Thank you again so much 🙏

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
arrow-ipc/src/reader.rs Show resolved Hide resolved
arrow-ipc/src/reader.rs Show resolved Hide resolved
@XiangpengHao
Copy link
Contributor Author

Updated the tests! I believe the CI failure is not related to this pr here..

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good to me -- thank you @XiangpengHao 🙏

We can keep iterating in subsequent PRs I think

@@ -1247,6 +1291,22 @@ fn write_array_data(
compression_codec,
)?;
}
} else if matches!(data_type, DataType::BinaryView | DataType::Utf8View) {
// Slicing the views buffer is safe and easy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb mentioned this pull request Mar 28, 2024
@alamb
Copy link
Contributor

alamb commented Mar 28, 2024

Updated the tests! I believe the CI failure is not related to this pr here..

I agree it doesn't look related. I made #5564 to test this theory

Update: the CI fails on main. as well, filed #5565

@alamb
Copy link
Contributor

alamb commented Mar 28, 2024

Update: the CI fails on main. as well, filed #5565 -- I'll try and look at it in a day or two if no one beats me to it

@alamb
Copy link
Contributor

alamb commented Mar 31, 2024

I took the liberty of merging up from master to this branch to hopefully get a clean CI run

@XiangpengHao
Copy link
Contributor Author

Thanks @alamb the CI passed!

@alamb alamb merged commit 17058c7 into apache:master Apr 1, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

Thanks again @XiangpengHao -- this is a very nice step forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants