Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet string statistics generation #643

Merged
merged 3 commits into from
Aug 8, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 31, 2021

Which issue does this PR close?

Closes #641

Rationale for this change

Statistics for strings (aka ByteArrays in the parquet format) are not calculated correctly

For example, given the strings "z", and "aa", the parquet writer will determine that "z" is the minimum value (because "z" is shorter).

What changes are included in this PR?

  1. Fix the comparison code for ByteArray to lexographically compare the data
  2. Add more statistics test coverage (kudos to @crepererum for the test harness in fix NaN handling in parquet statistics #256 which made this easy)

Correctness

I verified that the python implementation compares the strings lexicographically as proposed in this PR rather than the current behavior (see #641 (comment))

I also messed around with how partial cmp works with std::Option and I believe this implementation is consistent. You can see it for yourself in this playground

Notes

I could not figure out how to test for "Null" values in statistics (though I tested end to end using the ArrowWriter and confirmed null values didn't appear in the statistics, as expected)

Also, there are three things testing revealed:

  1. BoolType columns produce Int32Stats which I found confusing
  2. FixedLenByteArray columns produce ByteArrayStats
  3. I don't know enough about how Int96 works to properly test it (my attempt failed badly)

Are there any user-facing changes?

correct statistics

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 31, 2021
}
Some(Ordering::Equal)
}
// sort nulls first (consistent with PartialCmp on Option)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation of PartialOrd seems to have been introduced in 48c3771 in apache/arrow#7622 by

It would be great if @zeevm or @sunchao had any additional context or comment to share.

Copy link
Member

@sunchao sunchao Aug 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! the original PR is apache/arrow#7586 but yea we missed this in the review and I think the existing logic is incorrect.

let stats = statistics_roundtrip::<BoolType>(&[true, false, false, true]);
assert!(stats.has_min_max_set());
// should this be BooleanStatistics??
if let Statistics::Int32(stats) = stats {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note 1: it is strange that a Boolean column produces Int32Stats (and not BooleanStats)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #659 to track


let stats = statistics_roundtrip::<FixedLenByteArrayType>(&input);
assert!(stats.has_min_max_set());
// should it be FixedLenByteArray?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note 2: it is strange that a FixedLenByteArray column produces ByteArrayStats (and not FixedLenByteArrayStats)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #660

}
}

// // TODO test int 96 stats -- this was failing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note 3: I have no idea if this test is incorrect or if there is something wrong with the Int96 implementation. As Int96 seems to be deprecated, I don't plan to spend much more time on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still keep INT96 for backward compatibility (it is still used today). In parquet-mr this is done by converting the binary into signed ints (see here).

@codecov-commenter
Copy link

Codecov Report

Merging #643 (7c573c8) into master (e84fe20) will increase coverage by 0.04%.
The diff coverage is 87.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #643      +/-   ##
==========================================
+ Coverage   82.47%   82.52%   +0.04%     
==========================================
  Files         167      168       +1     
  Lines       46454    47280     +826     
==========================================
+ Hits        38314    39016     +702     
- Misses       8140     8264     +124     
Impacted Files Coverage Δ
parquet/src/data_type.rs 77.80% <85.71%> (+0.16%) ⬆️
parquet/src/column/writer.rs 92.93% <87.27%> (-0.37%) ⬇️
arrow/src/array/equal_json.rs 88.69% <0.00%> (-2.52%) ⬇️
parquet/src/arrow/schema.rs 87.35% <0.00%> (-1.18%) ⬇️
arrow/src/datatypes/datatype.rs 66.08% <0.00%> (-0.74%) ⬇️
arrow/src/array/equal/mod.rs 93.45% <0.00%> (-0.48%) ⬇️
arrow/src/datatypes/field.rs 51.36% <0.00%> (-0.47%) ⬇️
arrow/src/array/builder.rs 85.97% <0.00%> (-0.29%) ⬇️
arrow/src/ipc/convert.rs 92.93% <0.00%> (-0.06%) ⬇️
arrow/src/json/reader.rs 83.99% <0.00%> (-0.04%) ⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e84fe20...7c573c8. Read the comment docs.

@alamb alamb requested a review from sunchao August 1, 2021 10:30
@@ -1356,7 +1351,7 @@ mod tests {
let ba4 = ByteArray::from(vec![]);
let ba5 = ByteArray::from(vec![2, 2, 3]);

assert!(ba1 > ba2);
assert!(ba1 < ba2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2021

Thanks for the review -- I plan to polish this one up tomorrow

@alamb alamb force-pushed the alamb/parquet_stats_bug branch from 7c573c8 to d0c3e93 Compare August 5, 2021 12:41
}

#[test]
fn test_int96_statistics() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed Int96 stats test

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2021

@nevi-me / @sunchao I have resolved all outstanding questions in this PR and I think it is ready for review

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the change and the test cases, LGTM

@nevi-me
Copy link
Contributor

nevi-me commented Aug 8, 2021

I'm merging this one, @alamb has opened issues for follow-ups of what he noted while working on this. If there's anything else, it can be addressed separately.

@nevi-me nevi-me merged commit 4618ef5 into apache:master Aug 8, 2021
@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2021

Thank you @nevi-me ❤️

@alamb alamb deleted the alamb/parquet_stats_bug branch August 8, 2021 10:30
alamb added a commit that referenced this pull request Aug 8, 2021
* Fix string statistics generation, add tests

* fix Int96 stats test

* Add notes for additional tickets
alamb added a commit that referenced this pull request Aug 9, 2021
* Fix string statistics generation, add tests

* fix Int96 stats test

* Add notes for additional tickets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect min/max statistics for strings in parquet files
5 participants