Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590

Closed
wants to merge 1 commit into from

Conversation

carols10cents
Copy link
Contributor

@carols10cents carols10cents commented Nov 4, 2020

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their ArrayData's BufferData capacities.

This leaves BufferData's PartialEq implementation as-is (that is, taking the equality of capacity into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of Buffer.

This does change PartialEq for ArrayData to not assert to fix the problems noted by @jorgecarleitao .

This has the downside of making test failures where ArrayDatas should be equal but aren't harder to interpret because the assert_eq!s aren't as granular. We can't use debug_assert_eq! either, because then we can't write tests that do assert_ne!(array_data1, array_data2) (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

@github-actions
Copy link

github-actions bot commented Nov 4, 2020

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Nov 4, 2020

Could you describe where do we need to compare buffers?

I am asking because I have worked on this problem before, and in my last iteration of this, I abandoned the idea of comparing buffers: a Buffer is purely a physical representation of data without any logical interpretation: two buffers from two different arrays can be equal and the arrays still be different (e.g. due to the nullability or offset), and the opposite is also true: two arrays can be equal but have different buffers (e.g. if the child data is different, or if the datatype is different).

My current hypothesis is that equality should be done via ArrayData, which contains all the relevant data and logical representation of that data -- DataType, to make it possible to logically compare two arrays. I fielded #8541 with that idea, which IMO will also help the parquet work. I think that @nevi-me also had some ideas in mind.

@paddyhoran
Copy link
Contributor

I can't find the reference now but this was discussed some time ago on a PR. The thinking for taking capacity into account when comparing Buffer's was that array equality, etc. would be done at a higher level as @jorgecarleitao described. I think that @sunchao, in particular, favored taking capacity into account when comparing Buffer's. i.e. this is a more low-level equality check.

@paddyhoran
Copy link
Contributor

Here is the PR: #4331

See the comment about half way down (I'm can't seem to link to it directly)

@carols10cents
Copy link
Contributor Author

If I remove the impl PartialEq for BufferData entirely, compiling then complains that the #[derive(PartialEq)] on Buffer can't work. If I remove the derived PartialEq on Buffer, I then get compilation errors in places such as:

I'm not sure enough of those cases whether they do expect the capacity to be equal and mean this PartialEq should stay as-is, or if these cases are being tested at too low a level and should be checking ArrayData instead... please let me know what you all think!

@jorgecarleitao
Copy link
Member

@carols10cents , I do not think that we should remove impl PartialEq for BufferData. I am only trying to understand why would we would like to compare two Buffer and not take the ArrayData::offset into account in the equality. Without the offset, comparing buffers is basically comparing physical memory, and IMO has limited use-cases.

@carols10cents
Copy link
Contributor Author

I do not think that we should remove impl PartialEq for BufferData. I am only trying to understand why would we would like to compare two Buffer and not take the ArrayData::offset into account in the equality. Without the offset, comparing buffers is basically comparing physical memory, and IMO has limited use-cases.

I was not proposing that we remove impl PartialEq for BufferData, I was only doing that to get the compiler to tell me about all the places we're using it.

@carols10cents
Copy link
Contributor Author

Additional data to consider: I'm not confident the implementations are exactly comparable, but the C++ Buffer type does not take its capacity into account in its Equals implementations.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if rather than changing the meaning of what "equal" buffers means in the Arrow library, we could instead update this test. Specifically, rather than checking for buffer equality via

            assert_eq!(expected_data.buffers(), actual_data.buffers());

at https://github.com/apache/arrow/pull/8590/files#diff-e037401ca0787613b6d0b7ea98a14374ab37f2addfd079475fc1c33bba2e252cR922

This test could make a comparison specific to the arrow round trip.

For example:

            assert!(compare_buffers(expected_data.buffers(), actual_data.buffers()).unwrap());

With

    // Use a specialized comparison here to check the data of the
    // buffers, ignoring offsets
    fn compare_buffers(b1: &[Buffer], b2: &[Buffer]) -> std::result::Result<bool, String> {
        // Compare two lists of buffers, ignorning offset
        if b1.len() != b2.len() {
            return Err(format!("Length {} doesn't match length {}", b1.len(), b2.len()));
        }

        for (idx, (buffer1, buffer2)) in b1.iter().zip(b2.iter()).enumerate() {
            if buffer1.data() != buffer2.data() {
                return Err(format!("Buffer data mismatch at index {}", idx));
            }
        }

        Ok(true)
    }

I would ideally suggest using the eq kernel, but that appears to only be implemented for primitive arrays, so the following doesn't work 😢

            let expected_col = expected_batch.column(i);
            let actual_col = actual_batch.column(i);
            let compare = arrow::compute::eq(expected_col, actual_col).expect("Comparison");
            for row in 0..compare.len() {
                assert!(compare.value(row), "Mismatch in batch {} @ row {}", i, row);
            }

@carols10cents carols10cents changed the title ARROW-10042: [Rust] Only compare BufferData's data for equality ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality Nov 6, 2020
@carols10cents
Copy link
Contributor Author

carols10cents commented Nov 6, 2020

@jorgecarleitao @nevi-me @alamb I've just force pushed an alternative solution to this problem that doesn't change BufferData's PartialEq implementation, and I've added more tests to demonstrate the problems I'm trying to solve. I would appreciate another look and to know what you think!

I've updated the PR title and description. I'm still associating this to the same JIRA ticket as I think this change will "resolve" the question posed in the ticket, but it no longer "fixes" the problem as proposed there.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me

)
&& self.child_data() == other.child_data()
&& self.null_count() == other.null_count()
// null arrays can skip the null bitmap, thus only compare if there are no nulls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure those are @nevi-me 's, I just kept them in there :)

&& self.len() == other.len()
// TODO: when adding tests for this, test that we can compare with arrays that have
// offsets
&& self.offset() == other.offset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the offset does not have to be equal. Consider a buffer [1,2,3,4,1,2,3,4] and two arrays referencing this buffer, one with offset 0 and len 4, the other with offset 4 and len 4. Both arrays should be considered to contain the same 4 elements. This also means the compare_buffers function needs to take the array length as additional parameter.

// null arrays can skip the null bitmap, thus only compare if there are no nulls
// previous line would fail if null counts differed, so this check only needs to
// check one null_count to see if there are no nulls
&& (self.null_count() == 0 || compare_buffer_regions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_buffers_regions also needs an additional parameter for the array length to be correct since the array could be a slice of some bigger underlying buffers.

I missed the PR that introduced compare_buffers_regions, the way it is written it will only work correctly for bitpacked buffers. That is correct for this call, but I found another usage where it is used for other buffers. It uses the buffer.bit_slice method which is documented to take an offset in bits.

Copy link
Contributor

@jhorstmann jhorstmann Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at that function again, compare_buffers_regions and also PartialEq for ArrayData do assert_eq! instead of returning true/false, that is probably also not intended.

Update: Ok, I see the return type is fixed in this PR.

@nevi-me nevi-me self-requested a review November 7, 2020 05:08
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the changes in this PR need motivation on why do we should continue to pursue having two different implementations of array comparison:

  • impl PartialEq for ArrayData
  • the ArrayEqual trait

IMO there should be one, and only one, implementation of array comparison. Currently we have two, and this PR doubles down on this path. I am not convinced we should continue on this path because it lead us to the path where we will disagree on whether two arrays are equal or not depending on which implementation we use to equate them.

IMO, the root cause of the problem that this PR is trying to solve is exactly the one I just described: we have two implementations of array equality and the tests are failing when one of then is used, even though what is being tested is correct (it is the equality implementation that is wrong). I bet that if we would use make_array and downcast each array to its correct type, the tests would pass.

It was fine to hack PartialEq on the parquet writer branch, because unfortunately we have no way of comparing two ArrayData, but IMO we should now make an effort to ensure that there is one way of comparing two arrays.

As @jhorstmann commented, array comparison is not straightforward as we can't just equate offsets (and many other things). The equality calculation implicitly described in the specification is actually non-trivial, specially for arrays with offset buffers, structs, dictionaries, etc.

#8541 addresses the root cause of this by refactoring the Array equality to depend on a single function that compares two ArrayData, without using the ArrayEqual trait. The motivation for this is exactly what we are trying to solve here: we need to be able to equate two ArrayData because often we operate without the Array trait present (i.e. directly using ArrayData), and the Array equality should derive directly from the equality of ArrayData.

One idea:

  1. Make the array comparison on the parquet tests be done using the PartialEq, like this PR is doing
  2. Make PartialEq use the function equal being introduced on ARROW-10402: [Rust] Refactor array equality #8541
  3. Remove both #[ignore]

I played with this idea here: https://github.com/jorgecarleitao/arrow/pull/20 and it does seem to work: the tests pass with these changes, including the un-ignored ones.

@jorgecarleitao
Copy link
Member

Also, I very much agree with @carols10cents that we need a methodology to compare arrays with granularity :)

@nevi-me
Copy link
Contributor

nevi-me commented Nov 7, 2020

One idea:

  1. Make the array comparison on the parquet tests be done using the PartialEq, like this PR is doing
  2. Make PartialEq use the function equal being introduced on ARROW-10402: [Rust] Refactor array equality #8541
  3. Remove both #[ignore]

I also like this approach, as #8541 now accounts for logical equality

@andygrove andygrove added the needs-rebase A PR that needs to be rebased by the author label Nov 10, 2020
@nevi-me
Copy link
Contributor

nevi-me commented Nov 10, 2020

Hi @carols10cents, with the IPC integration and logical equality PRs merged, this should be ready (the Parquet changes).
We can merge it in after you rebase :)

@carols10cents
Copy link
Contributor Author

@nevi-me done! The only changes left here are in tests :) Thank you all for the work related to this!

@nevi-me nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Nov 10, 2020
When compared using the PartialEq implementation, two Buffers that only
differ in their capacity are considered unequal. This comparison is
relevant when testing implementation details of Buffer and BufferData.

When comparing Buffers that are part of ArrayData, the Buffer capacity
should NOT be taken into account: the logical equality of the data is
computed differently.
@alamb
Copy link
Contributor

alamb commented Nov 10, 2020

@nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so?

@alamb
Copy link
Contributor

alamb commented Nov 10, 2020

(though the title may need to be updated)

@carols10cents carols10cents changed the title ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality Nov 10, 2020
@carols10cents
Copy link
Contributor Author

(updated)

@nevi-me
Copy link
Contributor

nevi-me commented Nov 10, 2020

@nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so?

You may go ahead and merge it :) I was waiting for integration tests to complete, but I'm about to head to sleep now

@alamb
Copy link
Contributor

alamb commented Nov 10, 2020

The tests have passed:
Screen Shot 2020-11-10 at 5 53 11 PM

I'll merge it

@alamb alamb closed this in 14501b2 Nov 10, 2020
yordan-pavlov pushed a commit to yordan-pavlov/arrow that referenced this pull request Nov 14, 2020
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities.

This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`.

This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao .

This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

Closes apache#8590 from carols10cents/bufferdata-eq

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: alamb <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities.

This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`.

This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao .

This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

Closes apache#8590 from carols10cents/bufferdata-eq

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: alamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants