-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) #32496
[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) #32496
Conversation
Clean solution
28f85ae
to
f86fba6
Compare
genHashInt(s"Float.floatToIntBits(0.0f)", result) + | ||
"}else{" + | ||
genHashInt(s"Float.floatToIntBits($input)", result) + | ||
"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit style:
protected def genHashFloat(input: String, result: String): String = {
s"""
|if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) {
| ${genHashInt(s"Float.floatToIntBits(0.0f)", result)}
|} else {
| ${genHashInt(s"Float.floatToIntBits($input)", result)}
|}
""".stripMargin
}
genHashLong(s"Double.doubleToLongBits(0.0d)", result) + | ||
"}else{" + | ||
genHashLong(s"Double.doubleToLongBits($input)", result) + | ||
"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
ok to test |
Could you re-invoke the GA tests? It seems the weird errors happened. |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #138447 has finished for PR 32496 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Hi @maropu , Thank for your comments. I have looked at the weird problems of the GA tests and I think they were because of the name of the branch. This name "feature/spark35207_hashnegativezero" causes the problem. I execute tests in the same commit but in another branch with the name "spark35207_hashnegativezero" and they are on going without this problem (https://github.com/planga82/sparkFork/actions/runs/836970758) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug, but it seems this is a long-standing behaviour for the hash functions. So, it's better to describe this behaviour change in the migration guide so that users can notice it easily? cc: @cloud-fan @dongjoon-hyun @viirya
genHashInt(s"Float.floatToIntBits($input)", result) | ||
protected def genHashFloat(input: String, result: String): String = { | ||
s""" | ||
|if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use floatToIntBits
here? $input == -0.0f
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, $input == 0.0f
should be good enough.
genHashLong(s"Double.doubleToLongBits($input)", result) | ||
protected def genHashDouble(input: String, result: String): String = { | ||
s""" | ||
|if(Double.doubleToLongBits($input) == Double.doubleToLongBits(-0.0d)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
assert(XxHash64(Seq(exprs1), 42).eval() == XxHash64(Seq(exprs2), 42).eval()) | ||
assert(HiveHash(Seq(exprs1)).eval() == HiveHash(Seq(exprs2)).eval()) | ||
} | ||
checkResult(Literal.create(0D, DoubleType), Literal.create(-0D, DoubleType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use checkEvaluation
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add float tests here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! thanks, I pretend to put float instead long
@@ -654,4 +654,30 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-35207: Compute hash consistent between -0.0 and 0.0 doubles with Codegen") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to add tests here (It's okay just to add tests in HashExprSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, if you use checkEvaluation
, both codegen and interpreted are checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very useful function!
Test build #138474 has finished for PR 32496 at commit
|
If you consider, even though it is a bug, that we should put it in the migration guide I can take care of it in this PR. |
Thank you for pinging me, @maropu . |
protected def genHashFloat(input: String, result: String): String = { | ||
s""" | ||
|if($input == -0.0f) { | ||
| ${genHashInt(s"Float.floatToIntBits(0.0f)", result)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this has the semantic, shall we use simply "0"
instead of s"Float.floatToIntBits(0.0f)"
? We may add some comment for the semantic instead.
jshell> Float.floatToIntBits(0.0f)
$1 ==> 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I have tested it to be sure.
Murmur3HashFunction.hashInt(java.lang.Float.floatToIntBits(0.0f), 42) == Murmur3HashFunction.hashInt(0, 42)
It is simpler.
protected def genHashDouble(input: String, result: String): String = { | ||
s""" | ||
|if($input == -0.0d) { | ||
| ${genHashLong(s"Double.doubleToLongBits(0.0d)", result)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. We had better use the simplest constant here instead of s"Double.doubleToLongBits(0.0d)"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, 0L
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as the previous point, thanks
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok. We should also update the migration guide. Also, please be more specific in the PR title as this is for float/double type only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about other hash functions, e.g., md5?
Applies only to floating point types, I have updated the title
The other functions require binary types, anyway, I have tested it and it does not reproduce the problem.
Thanks! |
Test build #138525 has finished for PR 32496 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
I recall we had some subtle problem where 0 and -0 should not be considered equal, but I don't think it's relevant here. |
thanks, merging to master! |
Test build #138535 has finished for PR 32496 at commit
|
What changes were proposed in this pull request?
Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types.
Here is an extract from IEEE 754:
From this, I deduce that the hash function must produce the same result for 0 and -0.
Why are the changes needed?
It is a correctness issue
Does this PR introduce any user-facing change?
This changes only affect to the hash function applied to -0 value in float and double types
How was this patch tested?
Unit testing and manual testing