-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix NaN handling in parquet statistics #256
Conversation
// simple "isNaN" check that works for all types | ||
if val == val { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce some generic T::is_nan
function into the data type abstraction layer (that basically returns False
for every non-float/double type and calls .is_nan()
for float/double) or is this local workaround good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary. It'll just be adding overhead for us.
The compiler can figure out what we're trying to do, and will optimise out the branch for anything that's not a floating point number.
Look at this example: https://godbolt.org/z/z7M3Mfqzz
The is_nan::<i32>()
always returns false
, and the ucomisd
does the floating point comparison
Codecov Report
@@ Coverage Diff @@
## master #256 +/- ##
==========================================
+ Coverage 82.52% 82.53% +0.01%
==========================================
Files 162 162
Lines 43672 43786 +114
==========================================
+ Hits 36039 36139 +100
- Misses 7633 7647 +14
Continue to review full report at Codecov.
|
acd4f52
to
fc907e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crepererum ! |
fc907e5
to
9c60e29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
Closes #255.
Rationale for this change
Fixes NaN handling in parquet statistics. This is in line with the C++ stack:
https://github.com/apache/arrow/blob/b3e43987c47b2f01b204a2d954f882f7161616ef/cpp/src/parquet/statistics_test.cc#L1000-L1043
What changes are included in this PR?
Filter out NaN values from statistics + tests.
Are there any user-facing changes?
Yes: formally NaN were included in the stats but at "random" (i.e. when the data started with an NaN than the min/max values are NaN, otherwise min/max are non-NaN). Now the behavior is: NaN are excluded always. If the only NaN values are present, then min/max are unset.