-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet string statistics generation #643
Conversation
} | ||
Some(Ordering::Equal) | ||
} | ||
// sort nulls first (consistent with PartialCmp on Option) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous implementation of PartialOrd
seems to have been introduced in 48c3771 in apache/arrow#7622 by
It would be great if @zeevm or @sunchao had any additional context or comment to share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! the original PR is apache/arrow#7586 but yea we missed this in the review and I think the existing logic is incorrect.
let stats = statistics_roundtrip::<BoolType>(&[true, false, false, true]); | ||
assert!(stats.has_min_max_set()); | ||
// should this be BooleanStatistics?? | ||
if let Statistics::Int32(stats) = stats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note 1: it is strange that a Boolean
column produces Int32Stats
(and not BooleanStats
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #659 to track
|
||
let stats = statistics_roundtrip::<FixedLenByteArrayType>(&input); | ||
assert!(stats.has_min_max_set()); | ||
// should it be FixedLenByteArray? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note 2: it is strange that a FixedLenByteArray
column produces ByteArrayStats
(and not FixedLenByteArrayStats
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #660
parquet/src/column/writer.rs
Outdated
} | ||
} | ||
|
||
// // TODO test int 96 stats -- this was failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note 3: I have no idea if this test is incorrect or if there is something wrong with the Int96 implementation. As Int96 seems to be deprecated, I don't plan to spend much more time on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still keep INT96 for backward compatibility (it is still used today). In parquet-mr this is done by converting the binary into signed ints (see here).
Codecov Report
@@ Coverage Diff @@
## master #643 +/- ##
==========================================
+ Coverage 82.47% 82.52% +0.04%
==========================================
Files 167 168 +1
Lines 46454 47280 +826
==========================================
+ Hits 38314 39016 +702
- Misses 8140 8264 +124
Continue to review full report at Codecov.
|
@@ -1356,7 +1351,7 @@ mod tests { | |||
let ba4 = ByteArray::from(vec![]); | |||
let ba5 = ByteArray::from(vec![2, 2, 3]); | |||
|
|||
assert!(ba1 > ba2); | |||
assert!(ba1 < ba2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for the review -- I plan to polish this one up tomorrow |
7c573c8
to
d0c3e93
Compare
} | ||
|
||
#[test] | ||
fn test_int96_statistics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed Int96 stats test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the change and the test cases, LGTM
I'm merging this one, @alamb has opened issues for follow-ups of what he noted while working on this. If there's anything else, it can be addressed separately. |
Thank you @nevi-me ❤️ |
* Fix string statistics generation, add tests * fix Int96 stats test * Add notes for additional tickets
Which issue does this PR close?
Closes #641
Rationale for this change
Statistics for strings (aka ByteArrays in the parquet format) are not calculated correctly
For example, given the strings "z", and "aa", the parquet writer will determine that "z" is the minimum value (because "z" is shorter).
What changes are included in this PR?
Correctness
I verified that the python implementation compares the strings lexicographically as proposed in this PR rather than the current behavior (see #641 (comment))
I also messed around with how partial cmp works with
std::Option
and I believe this implementation is consistent. You can see it for yourself in this playgroundNotes
I could not figure out how to test for "Null" values in statistics (though I tested end to end using the ArrowWriter and confirmed null values didn't appear in the statistics, as expected)
Also, there are three things testing revealed:
Are there any user-facing changes?
correct statistics