-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590
Conversation
Could you describe where do we need to compare buffers? I am asking because I have worked on this problem before, and in my last iteration of this, I abandoned the idea of comparing buffers: a My current hypothesis is that equality should be done via |
I can't find the reference now but this was discussed some time ago on a PR. The thinking for taking |
Here is the PR: #4331 See the comment about half way down (I'm can't seem to link to it directly) |
If I remove the
I'm not sure enough of those cases whether they do expect the |
@carols10cents , I do not think that we should remove |
I was not proposing that we remove |
Additional data to consider: I'm not confident the implementations are exactly comparable, but the C++ Buffer type does not take its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if rather than changing the meaning of what "equal" buffers means in the Arrow library, we could instead update this test. Specifically, rather than checking for buffer equality via
assert_eq!(expected_data.buffers(), actual_data.buffers());
This test could make a comparison specific to the arrow round trip.
For example:
assert!(compare_buffers(expected_data.buffers(), actual_data.buffers()).unwrap());
With
// Use a specialized comparison here to check the data of the
// buffers, ignoring offsets
fn compare_buffers(b1: &[Buffer], b2: &[Buffer]) -> std::result::Result<bool, String> {
// Compare two lists of buffers, ignorning offset
if b1.len() != b2.len() {
return Err(format!("Length {} doesn't match length {}", b1.len(), b2.len()));
}
for (idx, (buffer1, buffer2)) in b1.iter().zip(b2.iter()).enumerate() {
if buffer1.data() != buffer2.data() {
return Err(format!("Buffer data mismatch at index {}", idx));
}
}
Ok(true)
}
I would ideally suggest using the eq
kernel, but that appears to only be implemented for primitive arrays, so the following doesn't work 😢
let expected_col = expected_batch.column(i);
let actual_col = actual_batch.column(i);
let compare = arrow::compute::eq(expected_col, actual_col).expect("Comparison");
for row in 0..compare.len() {
assert!(compare.value(row), "Mismatch in batch {} @ row {}", i, row);
}
6c82060
to
4d57eb2
Compare
@jorgecarleitao @nevi-me @alamb I've just force pushed an alternative solution to this problem that doesn't change I've updated the PR title and description. I'm still associating this to the same JIRA ticket as I think this change will "resolve" the question posed in the ticket, but it no longer "fixes" the problem as proposed there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me
rust/arrow/src/array/data.rs
Outdated
) | ||
&& self.child_data() == other.child_data() | ||
&& self.null_count() == other.null_count() | ||
// null arrays can skip the null bitmap, thus only compare if there are no nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure those are @nevi-me 's, I just kept them in there :)
rust/arrow/src/array/data.rs
Outdated
&& self.len() == other.len() | ||
// TODO: when adding tests for this, test that we can compare with arrays that have | ||
// offsets | ||
&& self.offset() == other.offset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the offset does not have to be equal. Consider a buffer [1,2,3,4,1,2,3,4]
and two arrays referencing this buffer, one with offset 0 and len 4, the other with offset 4 and len 4. Both arrays should be considered to contain the same 4 elements. This also means the compare_buffers
function needs to take the array length as additional parameter.
rust/arrow/src/array/data.rs
Outdated
// null arrays can skip the null bitmap, thus only compare if there are no nulls | ||
// previous line would fail if null counts differed, so this check only needs to | ||
// check one null_count to see if there are no nulls | ||
&& (self.null_count() == 0 || compare_buffer_regions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare_buffers_regions
also needs an additional parameter for the array length to be correct since the array could be a slice of some bigger underlying buffers.
I missed the PR that introduced compare_buffers_regions
, the way it is written it will only work correctly for bitpacked buffers. That is correct for this call, but I found another usage where it is used for other buffers. It uses the buffer.bit_slice
method which is documented to take an offset in bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at that function again, compare_buffers_regions
and alsoPartialEq for ArrayData
do assert_eq!
instead of returning true/false, that is probably also not intended.
Update: Ok, I see the return type is fixed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the changes in this PR need motivation on why do we should continue to pursue having two different implementations of array comparison:
impl PartialEq for ArrayData
- the
ArrayEqual
trait
IMO there should be one, and only one, implementation of array comparison. Currently we have two, and this PR doubles down on this path. I am not convinced we should continue on this path because it lead us to the path where we will disagree on whether two arrays are equal or not depending on which implementation we use to equate them.
IMO, the root cause of the problem that this PR is trying to solve is exactly the one I just described: we have two implementations of array equality and the tests are failing when one of then is used, even though what is being tested is correct (it is the equality implementation that is wrong). I bet that if we would use make_array
and downcast each array to its correct type, the tests would pass.
It was fine to hack PartialEq
on the parquet writer branch, because unfortunately we have no way of comparing two ArrayData
, but IMO we should now make an effort to ensure that there is one way of comparing two arrays.
As @jhorstmann commented, array comparison is not straightforward as we can't just equate offsets
(and many other things). The equality calculation implicitly described in the specification is actually non-trivial, specially for arrays with offset buffers, structs, dictionaries, etc.
#8541 addresses the root cause of this by refactoring the Array
equality to depend on a single function that compares two ArrayData
, without using the ArrayEqual
trait. The motivation for this is exactly what we are trying to solve here: we need to be able to equate two ArrayData
because often we operate without the Array
trait present (i.e. directly using ArrayData
), and the Array
equality should derive directly from the equality of ArrayData
.
One idea:
- Make the array comparison on the parquet tests be done using the
PartialEq
, like this PR is doing - Make
PartialEq
use the functionequal
being introduced on ARROW-10402: [Rust] Refactor array equality #8541 - Remove both
#[ignore]
I played with this idea here: https://github.com/jorgecarleitao/arrow/pull/20 and it does seem to work: the tests pass with these changes, including the un-ignored ones.
Also, I very much agree with @carols10cents that we need a methodology to compare arrays with granularity :) |
I also like this approach, as #8541 now accounts for logical equality |
Hi @carols10cents, with the IPC integration and logical equality PRs merged, this should be ready (the Parquet changes). |
4d57eb2
to
32c7c38
Compare
@nevi-me done! The only changes left here are in tests :) Thank you all for the work related to this! |
When compared using the PartialEq implementation, two Buffers that only differ in their capacity are considered unequal. This comparison is relevant when testing implementation details of Buffer and BufferData. When comparing Buffers that are part of ArrayData, the Buffer capacity should NOT be taken into account: the logical equality of the data is computed differently.
32c7c38
to
159648e
Compare
@nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so? |
(though the title may need to be updated) |
(updated) |
You may go ahead and merge it :) I was waiting for integration tests to complete, but I'm about to head to sleep now |
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes apache#8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: alamb <[email protected]>
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes apache#8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: alamb <[email protected]>
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.
This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their
ArrayData
'sBufferData
capacities.This leaves
BufferData
'sPartialEq
implementation as-is (that is, taking the equality ofcapacity
into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation ofBuffer
.This does change
PartialEq
forArrayData
to not assert to fix the problems noted by @jorgecarleitao .This has the downside of making test failures where
ArrayData
s should be equal but aren't harder to interpret because theassert_eq!
s aren't as granular. We can't usedebug_assert_eq!
either, because then we can't write tests that doassert_ne!(array_data1, array_data2)
(I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.