Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and fix for bug writing a null list. #1477

Closed

Conversation

novemberkilo
Copy link
Contributor

Which issue does this PR close?

Closes #1036

What changes are included in this PR?

A test and a minor fix. I'm not sure if the fix is a bit naive or not.

This was blocked by #1399 but now that the reader support is in for null lists, I am resurrecting this work.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 23, 2022
@novemberkilo novemberkilo force-pushed the topic/null-list-writer-fix branch from 19d242e to 308e79e Compare March 23, 2022 04:52
@novemberkilo
Copy link
Contributor Author

Possibly of interest @viirya @sunchao @tustvold @alamb ?

@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2022

Codecov Report

Merging #1477 (1948d4f) into master (d384836) will increase coverage by 0.00%.
The diff coverage is 96.15%.

@@           Coverage Diff           @@
##           master    #1477   +/-   ##
=======================================
  Coverage   82.70%   82.70%           
=======================================
  Files         187      187           
  Lines       54208    54233   +25     
=======================================
+ Hits        44832    44855   +23     
- Misses       9376     9378    +2     
Impacted Files Coverage Δ
parquet/src/arrow/levels.rs 84.88% <91.66%> (+0.21%) ⬆️
parquet/src/arrow/arrow_writer.rs 97.60% <100.00%> (+0.04%) ⬆️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 86.35% <0.00%> (-0.23%) ⬇️
arrow/src/datatypes/datatype.rs 66.80% <0.00%> (+0.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d384836...1948d4f. Read the comment docs.

parquet/src/arrow/levels.rs Outdated Show resolved Hide resolved
@@ -757,7 +757,10 @@ impl LevelInfo {

/// Given a level's information, calculate the offsets required to index an array correctly.
pub(crate) fn filter_array_indices(&self) -> Vec<usize> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test case for list array for filter_array_indices ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds sensible thanks @viirya I will try adding one.

Copy link
Contributor

@tustvold tustvold Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only makes sense for writing leaf arrays, and so I worry that whilst this may fix the case of a List of nulls, it will behave erratically if given a list of non-nulls. I'll see if I can work out why this method is getting called at all

Edit: Well this sure looks suspect - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L203

Copy link
Contributor

@tustvold tustvold Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have a fix but whilst writing this I've found what appears to be yet another bug 😞

Unfortunately it is EOD here, but I'll pick this back up first thing tomorrow (GMT) and see if I can't get a PR to fix both this bug and the one I've just discovered

@@ -1822,4 +1821,22 @@ mod tests {
// filter_array_indices should return the same indices in this case.
assert_eq!(level1.filter_array_indices(), level2.filter_array_indices());
}

#[test]
fn test_filter_indices_for_lists() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the machinery of levels so I don't know if I have made a nonsense test here or not. It passes ... but I don't know if that means anything yet. Does it seem sensible? // @viirya

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with definition/repetition level computation. I will take a look at the algorithm.

Copy link
Contributor

@nevi-me nevi-me Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my fault on the complexity here. Let's see if I can interpret the LevelInfo.

A nullable list will have 3 levels if its child array is also nullable.

So with the def levels:

  • 0: null list value null
  • 1: empty list []
  • 2: list with null [null]
  • 3: list with value [1]

Here's a comment earlier in this file (https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L402-L407). In this case, l1 = 0, l2 = 1, l3 = 2. With l3, the actual emptiness of the value (null or 1 above) gets determined by the child.

If the child is a non-null array, then def = 3 isn't needed as it'd always be true, so we'd only have def.max() == 2. With the same pattern, if the list is non-null, then we only need 2 levels.

  • 0: empty list
  • 1: list with value

The repetition levels let us know how nested the list is, so 0 is the start of a new list item, and 1 is a continuation.

If we have repetition like: [0, 1, 2, 1, 0, 2] plus its definitions, we can both construct the offsets (meaning that LevelInfo is inefficient as we don't really need its array_offsets) as follows:

Offsets = every value that's 0 is a new record, but if there's a 0, we have to check if the definition has an empty value for that record.

So we start with 2 list values:

[0, 1, 2, 1]
[0, 2]

Then we expand further to:

[ [0], [1, 2], [1] ] -> 3 nested values in the list slot
[ [0, 2] ] -> 1 nested value in the list slot

Using the offsets on your test (with reps shown):

1.  [0, 1, 2] (rep: [0, 1, 1])
2. []            (rep: [0])
3. [3, 4, 5] (rep: [0, 1, 1])

So in this case filter_array_indices is giving us the indices that have values, hence the [0, 1, 2, 3, 4, 5] result.


So maybe a good approach for null list values is to add definition = 2. I'd expect whichever index that's got a def of 2 or less to be excluded.

I'll be able to have a look at this PR tomorrow morning GMT, so I can tweak it or answer questions in detail then if you don't mind @novemberkilo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if you're interested in understanding what these repetition levels actually are there is a good explanation of how the parquet record shredding works here.

I also attempted to explain to myself what this function is doing which may be helpful - https://github.com/apache/arrow-rs/pull/1185/files#r785357233

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I tried to study the repetition/definition level algorithm a little bit. Seems this issue is not at filter_array_indices, although the error is bumped out there.

Seems we should not have List level info there for this case.

It comes from https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L203, we reuse the level info for the list. Looks suspect as @tustvold mentioned.

@novemberkilo
Copy link
Contributor Author

Thanks folks for engaging with this. Looks like the attention and discussion is flushing out the correct fix. I am relatively new to this project and while this is excellent at helping me learn, I wonder if this issue is best left to @tustvold and @viirya to tackle correctly. Happy to close this PR in favour of their work.

@tustvold
Copy link
Contributor

Ok I've filed #1481 which I think should fix the underlying bugs at play here. Thanks again @novemberkilo for reporting and drawing attention to this issue 👍

FWIW if you're interested in helping flesh out the rather limited support for nested types I would be more than happy to assist with filing tickets, reviews, etc... There is definitely a large amount of improvement that could be made in this area, what we support, what we test, etc... and I would be more than happy to assist where possible, just let me know 😄

@novemberkilo
Copy link
Contributor Author

Closing in favour of #1481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON input barfs on {"emptylist":[]}
5 participants