-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Null Mask Handling in ArrayData
, UnionArray
, and MapArray
#1589
Conversation
|
||
// Calculate a list child's logical bitmap/buffer | ||
#[inline] | ||
fn logical_list_bitmap<OffsetSize: OffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need this code, as we simply skip over null slices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you mean we skip over null slices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only call equal_ranges on the non null ranges
@@ -282,8 +282,7 @@ fn equal_range( | |||
rhs_start: usize, | |||
len: usize, | |||
) -> bool { | |||
utils::base_equal(lhs, rhs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils::base_equal
does two things:
- Compares that the data types are equal
- Compares that the lengths are equal
The first is not necessarily correct, as MapArrays are agnostic to field names. The second is also not necessarily correct in the presence of non-empty null slices. I therefore decided to just remove this and simplify the code.
Ultimately it isn't clear to me what this check could catch that wouldn't represent some violation of a variant somewhere else. If a nested type is inconsistent with the length, type, etc... of its parent then you've got UB anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually equal
already does utils::base_equal
. I guess it is okay to remove it from here, although I'm not pretty sure.
arrow/src/array/equal/utils.rs
Outdated
@@ -47,8 +47,13 @@ pub(super) fn equal_nulls( | |||
rhs_start: usize, | |||
len: usize, | |||
) -> bool { | |||
let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len); | |||
let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len); | |||
let lhs_null_count = count_nulls(lhs_nulls, lhs_start + lhs.offset(), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equal_bits a few lines is correctly taking into account the offset, but this line wasn't => excitement
@@ -137,16 +142,6 @@ pub(super) fn child_logical_null_buffer( | |||
Bitmap::from(Buffer::from(vec![0b11111111; ceil])) | |||
}); | |||
match parent_data.data_type() { | |||
DataType::List(_) | DataType::Map(_, _) => Some(logical_list_bitmap::<i32>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed, as null slices are skipped over in parent
More stuff is broken 😭 |
Various UnionArray fixes (apache#1598) (apache#1596) (apache#1591) (apache#1590) Fix handling of null masks in ArrayData equality (apache#1599)
I will continue to add tests, and file bugs for the various things this fixes, but I think the core is ready for review now |
Codecov Report
@@ Coverage Diff @@
## master #1589 +/- ##
==========================================
+ Coverage 82.88% 83.04% +0.16%
==========================================
Files 193 193
Lines 55307 55276 -31
==========================================
+ Hits 45839 45902 +63
+ Misses 9468 9374 -94
Continue to review full report at Codecov.
|
Thank you @tustvold for fixing these. I will begin to review this today. |
@@ -522,29 +516,29 @@ mod tests { | |||
match i { | |||
0 => { | |||
let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap(); | |||
assert!(!union.is_null(i)); | |||
assert!(!slot.is_null(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to check null in the current slot, but does union.is_null(index)
not work anymore? Can we check both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nullability is only a property of the slot, and not the Union itself, at least that is my interpretation of the specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://arrow.apache.org/docs/format/Columnar.html#union-layout
Says
Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.
Which seems consistent 👍
However, it seems like UnionArray
doesn't override is_null
to take this into account. Filed #1625 to track
let mut builder = UnionBuilder::new_sparse(2); | ||
builder.append::<Float32Type>("a", 1.0).unwrap(); | ||
let err = builder.append::<Int32Type>("a", 1).unwrap_err().to_string(); | ||
assert!(err.contains("Attempt to write col \"a\" with type Int32 doesn't match existing type Float32"), "{}", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
arrow/src/array/builder.rs
Outdated
/// A builder for the bitmap if required (for Sparse Unions) | ||
bitmap_builder: Option<BooleanBufferBuilder>, | ||
bitmap_builder: BooleanBufferBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment seems out-of-date now?
arrow/src/array/builder.rs
Outdated
.null_bit_buffer(bitmap_builder.finish()); | ||
// .build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// .build();
seems redundant and can be removed?
match bitmap_builder { | ||
Some(mut bb) => arr_data_builder | ||
.null_bit_buffer(bb.finish()) | ||
.build_unchecked(), | ||
None => arr_data_builder.build_unchecked(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more clear now.
/// Can contain a null bitmask | ||
pub can_contain_null_mask: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the C++ structure definition doesn't have this, I think it is fine to have it here.
let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len); | ||
let rhs_null_count = count_nulls(rhs.null_buffer(), rhs_start + rhs.offset(), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we assume offset
is zero previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in lots of places 😅
@@ -282,8 +282,7 @@ fn equal_range( | |||
rhs_start: usize, | |||
len: usize, | |||
) -> bool { | |||
utils::base_equal(lhs, rhs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually equal
already does utils::base_equal
. I guess it is okay to remove it from here, although I'm not pretty sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me and makes some places more clear and simpler. It is better to have more eyes on the changes too.
I will try and review this more carefully tomorrow. cc @jhorstmann who might also be interested |
Looks good. This is some really gnarly code and last time I tried to improve it I gave up since it was never clear whether the start variable already include offsets or where the offsets have to be applied. I have two tests in #1499 where |
After the suggestion above, my remaining test failure was because of the |
Yes I have found the handling of this field to be very inconsistent, imo it shouldn't be possible to associate a null bitmask with a field that says it isn't nullable. Would you be able to create a ticket?
I presume you mean that the previous behaviour was wrong, and is now fixed, and not that this has broken behaviour that was previously correct?
I think the logic here will consider them not equal, are you suggesting they should be considered equal? |
I created #1611 for validating the
Sorry, my description probably was not very clear. In one test I had to compare individual elements of a ListArray because the previous equality implementation did not work with different offsets. With the changes from this PR that test can now use a simple The other test with Whether we should change the comparision semantic to exclude the nullable flag is probably out of scope for this PR, I don't have a strong opinion either way. A big "Thank You" for diving into this code and fixing it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this code a thorough review -- I am not the biggest expert in the low level details of what is going on internally but the changes made sense to me.
I also carefully checked all test changes and they looked good, so assuming the tests have good coverage I feel confident in this change.
Of course, our work to work to clean up nullness is likely not done with this PR but it I think things are better.
My only concerns are:
-
the change to
append_null
that requires a field name (as that is a public facing API). However, we can always change it in the future I suppose. -
UnionArray::is_null
incorrect #1625 seems non bad -- I think we could remove several changes in this PR (so it could callunion.is_null()
rather thanslot.is_null()
if we fixed that
Thank you @tustvold @viirya and @jhorstmann for driving this
cc @bjchambers who I think also has had much fun with offsets and @nevi-me who spent a bunch of time with this code I think
@@ -522,29 +516,29 @@ mod tests { | |||
match i { | |||
0 => { | |||
let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap(); | |||
assert!(!union.is_null(i)); | |||
assert!(!slot.is_null(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://arrow.apache.org/docs/format/Columnar.html#union-layout
Says
Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.
Which seems consistent 👍
However, it seems like UnionArray
doesn't override is_null
to take this into account. Filed #1625 to track
@@ -1894,23 +1894,19 @@ struct FieldData { | |||
values_buffer: Option<MutableBuffer>, | |||
/// The number of array slots represented by the buffer | |||
slots: usize, | |||
/// A builder for the bitmap if required (for Sparse Unions) | |||
bitmap_builder: Option<BooleanBufferBuilder>, | |||
/// A builder for the null bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
}; | ||
self.len += 1; | ||
Ok(()) | ||
pub fn append_null<T: ArrowPrimitiveType>(&mut self, type_name: &str) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize from an implementation perspective why you chose to require appending a null by field name, but it seems kind of a messy API -- what do you think about potentially appending a null to the first child or something by default?
Perhaps something to think about for a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is that UnionArray nulls are typed and so two arrays are not equal if they have nulls in different child arrays.
There is also the case of appending a null to an empty UnionArray...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- tried to encode that insight in comments as part of #1628
@@ -560,7 +554,7 @@ mod tests { | |||
builder.append::<Int32Type>("a", 1).unwrap(); | |||
builder.append::<Int64Type>("c", 3).unwrap(); | |||
builder.append::<Int32Type>("a", 10).unwrap(); | |||
builder.append_null().unwrap(); | |||
builder.append_null::<Int32Type>("a").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is strange to require a field name to append nulls -- I left a comment on how to improve things perhaps in the future
@@ -621,6 +621,13 @@ impl ArrayData { | |||
// Check that the data layout conforms to the spec | |||
let layout = layout(&self.data_type); | |||
|
|||
if !layout.can_contain_null_mask && self.null_bitmap.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -39,8 +36,8 @@ pub(super) fn fixed_binary_equal( | |||
let lhs_values = &lhs.buffers()[0].as_slice()[lhs.offset() * size..]; | |||
let rhs_values = &rhs.buffers()[0].as_slice()[rhs.offset() * size..]; | |||
|
|||
let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len); | |||
let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len); | |||
let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes a lot more sense to look at the buffer's nulls directly
builder.append::<Int32Type>("A", 34).unwrap(); | ||
let array = builder.build().unwrap(); | ||
|
||
let filter_array = BooleanArray::from(vec![true, false, true, false]); | ||
let c = filter(&array, &filter_array).unwrap(); | ||
let filtered = c.as_any().downcast_ref::<UnionArray>().unwrap(); | ||
|
||
let mut builder = UnionBuilder::new_dense(1); | ||
let mut builder = UnionBuilder::new_sparse(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was changed because the name of the test is test_filter_union_array_sparse_with_nulls
but we were using a dense
array previously
After some thought, I would like to merge this PR to handle all the undefined behavior bugs While there are more potential follow on things to do to improve union and null handling, this seems like a significant improvement. Unless there are any objections, I will plan to merge it later today |
🚀 |
ArrayData
, UnionArray
, and MapArray
* Update version to 13.0.0 * Draft changelog * updates * moar * cleanup * remove duplicated entry, update for #1589 * Final updates
Which issue does this PR close?
Closes #1599
Closes #1598
Closes #1596
Closes #1590
Closes #1591
Closes #626
Rationale for this change
See tickets, unfortunately these changes all ended up interacting with each other and so I've lumped them into a single PR.
What changes are included in this PR?
ArrayData Equality
The currently ArrayData equality code attempts to construct a logical bitmask, that propogates the nullability of the parent down to its children. Certain arrays, however, may have a different number of children than their parent, e.g. dense union array or list arrays. This requires recomputing the mask to factor this in. In some cases, this was computing a mask that took account of the offset of the child, but in others it was not. I also don't think any of the codepaths took account for if the child also contained an offset.
Taking a step back, none of the codepaths are actually comparing runs of nulls, instead they fall back to iterating per value if the array contains nulls. For example,
As a result we don't actually need this complexity, and by removing it we avoid ambiguity about what offset the null mask has, what the null mask is, etc...
UnionArray Nullability
Whilst working on this I realised that UnionArray's don't actually have a parent validity mask, as this change would have implications for the equality logic I lumped it into this PR.
Are there any user-facing changes?
Array equality should be more correct, and UnionArrays slightly less broken