-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make consistent behavior on zeros equality on floating point types #3510
Conversation
I'm not sure about this, as it means we no longer are comparing with respect to a standard predicate but one of our own devising. Why special case zero, and not other values like NaNs? I'd also be interested to know what impact this has on benchmarks. FWIW If we do make this change, we will need to make changes to normalise within the row format, along with potentially in other places also. Nothing insurmountable, just noting it |
{ | ||
let left: PrimitiveArray<T> = PrimitiveArray::from(left.data().clone()); | ||
let right: PrimitiveArray<T> = PrimitiveArray::from(right.data().clone()); | ||
Box::new(move |i, j| left.value(i).cmp(&right.value(j))) | ||
Box::new(move |i, j| left.value(i).compare(right.value(j))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 regardless this is a good change
NaNs are treated as equal by total ordering. I guess total ordering needs to give a comprehensive ordering for possible floating point values. But in practice computation, we don't actually separate positive and negative zeros. |
Not the ordering we use currently, they're ordered based on their constituent bits, NaNs with different byte representations will not compare equal |
We have NaN equality test to verify that they are equal. I also did a quick verification in rust playground: fn main() {
let a = f32::NAN;
let b = f32::NAN;
println!("a == b: {}", a.to_bits() == b.to_bits());
} Output:
|
f32::NaN always returns the same NaN bytes, if you get a NaN by other means such that they have different bit representations you will see the difference Edit: in fact comparing |
I see. That explains why these NaNs are equal. I roughly remember that from JVM experience NaN values' bits are different so I was a bit surprised to see they are equal in above test/play-ground. If there are other bit patterns in Rust that will be seen as NaN too, then it is not guaranteed to be equal. NaNs should be treated as equal in computation too, like zeros. Either adding NaN-specific condition like zero, or we avoid such things here and require users to handle it before calling arrow kernels. For example, replacing negative zeros with positive zeros, normalizing NaNs with standard f32::NaN (f64, f16 too). |
arrow-array/src/arithmetic.rs
Outdated
if self.abs() == $zero && rhs.abs() == $zero { | ||
// `total_cmp` treats positive zero and negative zero as different. | ||
// But for computation system, it usually treats them as equal. | ||
Ordering::Equal | ||
} else { | ||
<$t>::total_cmp(&self, &rhs) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these changes.
/// Note that totalOrder treats positive and negative zeros are different. If it is necessary | ||
/// to treat them as equal, please normalize zeros before calling this kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated these docs to make the behavior clear to users.
assert_eq!(Ordering::Less, (cmp)(0, 1)); | ||
assert_eq!(Ordering::Greater, (cmp)(1, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build_compare
's behavior on zeros comparison is inconsistent with comparison kernels. Changed it to consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
Looks good to me! Just a note that the min/max aggregation kernels also use a different definition, I think following the postgres behavior of considering NaN to be greater than any other value. |
Yeah it is honestly baffling to me that they took so long to define a total ordering predicate, we now have a standard but few people follow it 😅 |
Benchmark runs are scheduled for baseline = 8688dba and contender = d49cd21. d49cd21 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3509.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?