Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix a bug in how definition levels are calculated for nested structs in a list #1185

Merged
merged 2 commits into from
Jan 17, 2022

Conversation

helgikrs
Copy link
Contributor

@helgikrs helgikrs commented Jan 15, 2022

Which issue does this PR close?

Closes #1184.

What changes are included in this PR?

Using the definition level and the nullability of the column only produces the
correct indices if max_definition - 1 is the list level. For deeper nesting
(struct in a list) this produces incorrect indices, silently causing incorrect
data to be written.

This PR fixes the bug by using the array offsets to compute the indices
instead, also adds a regression test.

Are there any user-facing changes?

No.

Using the definition level and the nullability of the column only
produces the correct indices if max_definition - 1 is the list level.
For deeper nesting (struct in a list) this produces incorrect indices,
silently causing incorrect data to be written.

This fix uses the array offsets to compute the indices instead.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 15, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2022

Codecov Report

Merging #1185 (8714ad5) into master (66b84f3) will increase coverage by 0.00%.
The diff coverage is 95.45%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1185   +/-   ##
=======================================
  Coverage   82.66%   82.67%           
=======================================
  Files         173      173           
  Lines       50902    50916   +14     
=======================================
+ Hits        42077    42093   +16     
+ Misses       8825     8823    -2     
Impacted Files Coverage Δ
parquet/src/arrow/levels.rs 84.56% <95.45%> (+0.28%) ⬆️
arrow/src/array/transform/mod.rs 85.43% <0.00%> (-0.14%) ⬇️
arrow/src/buffer/immutable.rs 98.92% <0.00%> (ø)
arrow/src/datatypes/native.rs 66.66% <0.00%> (ø)
parquet/src/encodings/encoding.rs 93.71% <0.00%> (+0.19%) ⬆️
parquet_derive/src/parquet_field.rs 66.43% <0.00%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66b84f3...8714ad5. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mega kudos for working this one out, my head hurts now 😆

I'm afraid I can't approve PRs on this repo, but if I could I'd give it a ✔️

@@ -759,13 +759,6 @@ impl LevelInfo {
/// Given a level's information, calculate the offsets required to index an array correctly.
pub(crate) fn filter_array_indices(&self) -> Vec<usize> {
// happy path if not dealing with lists
let is_nullable = match self.level_type {
LevelType::Primitive(is_nullable) => is_nullable,
_ => panic!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion is still important - this method will return nonsense if called on something that isn't a leaf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added the assertion back.

if *def >= self.max_definition - is_nullable as i16 {
index += 1;

for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {
Copy link
Contributor

@tustvold tustvold Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.

The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of LevelInfo that it then writes. Each LevelInfo identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.

It therefore iterates through the definition levels and where def == self.max_definition it records that index as one that needs to be written to parquet.

The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. def >= max_def - is_nullable would increase the index position in the source primitive array.

Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, a null in a ListArray may not take up space in its child array, but a null in a StructArray does.

By using the array_offsets instead of the nesting level, we can continue to handle list array nulls that don't take up space, whilst handling StructArray nulls that do. 👍

As an added bonus I think this fixes a related bug, where a ListArray with a non-zero length null slot would break for similar reasons


for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {
if len == 0 {
// Skip this definition level (it should be less than max_definition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this code should verify that it is indeed a null? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't follow the details of this PR fully, I trust that @tustvold and @helgikrs know what they are doing

Thank you both for the contributions

@alamb alamb merged commit 85edafc into apache:master Jan 17, 2022
@alamb alamb changed the title fix: Fix a bug in how filter indices are calculated fix: Fix a bug in how definition levels are calculated for nested structs in a list Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing structs nested in lists produces an incorrect output
4 participants