-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test and fix for bug writing a null list. #1477
Test and fix for bug writing a null list. #1477
Conversation
19d242e
to
308e79e
Compare
Codecov Report
@@ Coverage Diff @@
## master #1477 +/- ##
=======================================
Coverage 82.70% 82.70%
=======================================
Files 187 187
Lines 54208 54233 +25
=======================================
+ Hits 44832 44855 +23
- Misses 9376 9378 +2
Continue to review full report at Codecov.
|
@@ -757,7 +757,10 @@ impl LevelInfo { | |||
|
|||
/// Given a level's information, calculate the offsets required to index an array correctly. | |||
pub(crate) fn filter_array_indices(&self) -> Vec<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a test case for list array for filter_array_indices
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds sensible thanks @viirya I will try adding one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only makes sense for writing leaf arrays, and so I worry that whilst this may fix the case of a List of nulls, it will behave erratically if given a list of non-nulls. I'll see if I can work out why this method is getting called at all
Edit: Well this sure looks suspect - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I have a fix but whilst writing this I've found what appears to be yet another bug 😞
Unfortunately it is EOD here, but I'll pick this back up first thing tomorrow (GMT) and see if I can't get a PR to fix both this bug and the one I've just discovered
@@ -1822,4 +1821,22 @@ mod tests { | |||
// filter_array_indices should return the same indices in this case. | |||
assert_eq!(level1.filter_array_indices(), level2.filter_array_indices()); | |||
} | |||
|
|||
#[test] | |||
fn test_filter_indices_for_lists() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the machinery of levels so I don't know if I have made a nonsense test here or not. It passes ... but I don't know if that means anything yet. Does it seem sensible? // @viirya
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with definition/repetition level computation. I will take a look at the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my fault on the complexity here. Let's see if I can interpret the LevelInfo
.
A nullable list will have 3 levels if its child array is also nullable.
So with the def levels:
- 0: null list value
null
- 1: empty list
[]
- 2: list with null
[null]
- 3: list with value
[1]
Here's a comment earlier in this file (https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L402-L407). In this case, l1 = 0
, l2 = 1
, l3 = 2
. With l3
, the actual emptiness of the value (null
or 1
above) gets determined by the child.
If the child is a non-null array, then def = 3
isn't needed as it'd always be true, so we'd only have def.max() == 2
. With the same pattern, if the list is non-null, then we only need 2 levels.
- 0: empty list
- 1: list with value
The repetition levels let us know how nested the list is, so 0 is the start of a new list item, and 1 is a continuation.
If we have repetition like: [0, 1, 2, 1, 0, 2]
plus its definitions, we can both construct the offsets (meaning that LevelInfo
is inefficient as we don't really need its array_offsets
) as follows:
Offsets = every value that's 0 is a new record, but if there's a 0, we have to check if the definition has an empty value for that record.
So we start with 2 list values:
[0, 1, 2, 1]
[0, 2]
Then we expand further to:
[ [0], [1, 2], [1] ] -> 3 nested values in the list slot
[ [0, 2] ] -> 1 nested value in the list slot
Using the offsets on your test (with reps shown):
1. [0, 1, 2] (rep: [0, 1, 1])
2. [] (rep: [0])
3. [3, 4, 5] (rep: [0, 1, 1])
So in this case filter_array_indices
is giving us the indices that have values, hence the [0, 1, 2, 3, 4, 5] result.
So maybe a good approach for null list values is to add definition = 2. I'd expect whichever index that's got a def of 2 or less to be excluded.
I'll be able to have a look at this PR tomorrow morning GMT, so I can tweak it or answer questions in detail then if you don't mind @novemberkilo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW if you're interested in understanding what these repetition levels actually are there is a good explanation of how the parquet record shredding works here.
I also attempted to explain to myself what this function is doing which may be helpful - https://github.com/apache/arrow-rs/pull/1185/files#r785357233
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I tried to study the repetition/definition level algorithm a little bit. Seems this issue is not at filter_array_indices
, although the error is bumped out there.
Seems we should not have List
level info there for this case.
It comes from https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L203
, we reuse the level info for the list. Looks suspect as @tustvold mentioned.
Thanks folks for engaging with this. Looks like the attention and discussion is flushing out the correct fix. I am relatively new to this project and while this is excellent at helping me learn, I wonder if this issue is best left to @tustvold and @viirya to tackle correctly. Happy to close this PR in favour of their work. |
Ok I've filed #1481 which I think should fix the underlying bugs at play here. Thanks again @novemberkilo for reporting and drawing attention to this issue 👍 FWIW if you're interested in helping flesh out the rather limited support for nested types I would be more than happy to assist with filing tickets, reviews, etc... There is definitely a large amount of improvement that could be made in this area, what we support, what we test, etc... and I would be more than happy to assist where possible, just let me know 😄 |
Closing in favour of #1481 |
Which issue does this PR close?
Closes #1036
What changes are included in this PR?
A test and a minor fix. I'm not sure if the fix is a bit naive or not.
This was blocked by #1399 but now that the reader support is in for null lists, I am resurrecting this work.