[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) #32496

planga82 · 2021-05-11T00:28:54Z

What changes were proposed in this pull request?

Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types.

scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show
+-------------------------+--------------------------+
|hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
+-------------------------+--------------------------+
|              -1670924195|                -853646085|
+-------------------------+--------------------------+
scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show
+--------------------------------------------+
|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
+--------------------------------------------+
|                                        true|
+--------------------------------------------+

Here is an extract from IEEE 754:

The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases

From this, I deduce that the hash function must produce the same result for 0 and -0.

Why are the changes needed?

It is a correctness issue

Does this PR introduce any user-facing change?

This changes only affect to the hash function applied to -0 value in float and double types

How was this patch tested?

Unit testing and manual testing

Clean solution

maropu · 2021-05-12T08:29:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

+      genHashInt(s"Float.floatToIntBits(0.0f)", result) +
+      "}else{" +
+      genHashInt(s"Float.floatToIntBits($input)", result) +
+      "}"


nit style:

protected def genHashFloat(input: String, result: String): String = { s""" |if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) { | ${genHashInt(s"Float.floatToIntBits(0.0f)", result)} |} else { | ${genHashInt(s"Float.floatToIntBits($input)", result)} |} """.stripMargin }

maropu · 2021-05-12T08:29:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

+      genHashLong(s"Double.doubleToLongBits(0.0d)", result) +
+      "}else{" +
+      genHashLong(s"Double.doubleToLongBits($input)", result) +
+      "}"


maropu · 2021-05-12T08:29:34Z

ok to test

maropu · 2021-05-12T08:30:23Z

Could you re-invoke the GA tests? It seems the weird errors happened.

SparkQA · 2021-05-12T09:34:02Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42968/

SparkQA · 2021-05-12T13:00:47Z

Test build #138447 has finished for PR 32496 at commit f86fba6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-12T22:28:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42995/

SparkQA · 2021-05-12T22:28:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42995/

planga82 · 2021-05-12T22:49:38Z

Hi @maropu , Thank for your comments. I have looked at the weird problems of the GA tests and I think they were because of the name of the branch. This name "feature/spark35207_hashnegativezero" causes the problem. I execute tests in the same commit but in another branch with the name "spark35207_hashnegativezero" and they are on going without this problem (https://github.com/planga82/sparkFork/actions/runs/836970758)

maropu · 2021-05-12T23:29:47Z

Yea, it seems @ueshin 's working on it #32524

maropu

This is a bug, but it seems this is a long-standing behaviour for the hash functions. So, it's better to describe this behaviour change in the migration guide so that users can notice it easily? cc: @cloud-fan @dongjoon-hyun @viirya

maropu · 2021-05-13T00:02:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

-    genHashInt(s"Float.floatToIntBits($input)", result)
+  protected def genHashFloat(input: String, result: String): String = {
+    s"""
+       |if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) {


Why do we need to use floatToIntBits here? $input == -0.0f instead?

+1, $input == 0.0f should be good enough.

maropu · 2021-05-13T00:03:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

-    genHashLong(s"Double.doubleToLongBits($input)", result)
+  protected def genHashDouble(input: String, result: String): String = {
+    s"""
+      |if(Double.doubleToLongBits($input) == Double.doubleToLongBits(-0.0d)) {


maropu · 2021-05-13T00:03:34Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

+      assert(XxHash64(Seq(exprs1), 42).eval() == XxHash64(Seq(exprs2), 42).eval())
+      assert(HiveHash(Seq(exprs1)).eval() == HiveHash(Seq(exprs2)).eval())
+    }
+    checkResult(Literal.create(0D, DoubleType), Literal.create(-0D, DoubleType))


Please use checkEvaluation instead.

Could you add float tests here, too?

Oh! thanks, I pretend to put float instead long

maropu · 2021-05-13T00:12:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala

@@ -654,4 +654,30 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
      }
    }
  }
+
+  test("SPARK-35207: Compute hash consistent between -0.0 and 0.0 doubles with Codegen") {


I think we don't need to add tests here (It's okay just to add tests in HashExprSuite.

+1, if you use checkEvaluation, both codegen and interpreted are checked.

very useful function!

SparkQA · 2021-05-13T02:00:18Z

Test build #138474 has finished for PR 32496 at commit 641629d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…7_hashnegativezero

planga82 · 2021-05-13T20:43:13Z

If you consider, even though it is a bug, that we should put it in the migration guide I can take care of it in this PR.

dongjoon-hyun · 2021-05-13T20:43:57Z

Thank you for pinging me, @maropu .

dongjoon-hyun · 2021-05-13T20:53:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

+  protected def genHashFloat(input: String, result: String): String = {
+    s"""
+       |if($input == -0.0f) {
+       |  ${genHashInt(s"Float.floatToIntBits(0.0f)", result)}


Although this has the semantic, shall we use simply "0" instead of s"Float.floatToIntBits(0.0f)"? We may add some comment for the semantic instead.

jshell> Float.floatToIntBits(0.0f) $1 ==> 0

You are right, I have tested it to be sure.
Murmur3HashFunction.hashInt(java.lang.Float.floatToIntBits(0.0f), 42) == Murmur3HashFunction.hashInt(0, 42)
It is simpler.

dongjoon-hyun · 2021-05-13T20:55:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

+  protected def genHashDouble(input: String, result: String): String = {
+    s"""
+      |if($input == -0.0d) {
+      |  ${genHashLong(s"Double.doubleToLongBits(0.0d)", result)}


ditto. We had better use the simplest constant here instead of s"Double.doubleToLongBits(0.0d)".

In this case, 0L?

The same as the previous point, thanks

SparkQA · 2021-05-13T21:18:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43046/

SparkQA · 2021-05-13T21:18:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43046/

viirya

Looks ok. We should also update the migration guide. Also, please be more specific in the PR title as this is for float/double type only?

viirya

How about other hash functions, e.g., md5?

planga82 · 2021-05-14T01:04:01Z

Looks ok. We should also update the migration guide. Also, please be more specific in the PR title as this is for float/double type only?

Applies only to floating point types, I have updated the title

How about other hash functions, e.g., md5?

The other functions require binary types, anyway, I have tested it and it does not reproduce the problem.

 spark.sql("select md5(bin(cast('0.0' as double))) == md5(bin(cast('-0.0' as double)))").show
 spark.sql("select sha2(bin(cast('0.0' as double)),224) == sha2(bin(cast('-0.0' as double)),224)").show
 spark.sql("select sha1(bin(cast('0.0' as double))) == sha1(bin(cast('-0.0' as double)))").show
 spark.sql("select crc32(bin(cast('0.0' as double))) == crc32(bin(cast('-0.0' as double)))").show

Thanks!

SparkQA · 2021-05-14T01:07:35Z

Test build #138525 has finished for PR 32496 at commit 1de9c3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-14T01:32:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43053/

SparkQA · 2021-05-14T01:38:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43053/

srowen · 2021-05-14T03:51:42Z

I recall we had some subtle problem where 0 and -0 should not be considered equal, but I don't think it's relevant here.

cloud-fan · 2021-05-14T04:40:19Z

thanks, merging to master!

SparkQA · 2021-05-14T06:11:04Z

Test build #138535 has finished for PR 32496 at commit c123599.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions bot added the SQL label May 11, 2021

Changes implementation

f86fba6

Clean solution

planga82 force-pushed the feature/spark35207_hashnegativezero branch from 28f85ae to f86fba6 Compare May 11, 2021 00:44

maropu reviewed May 12, 2021

View reviewed changes

Style corrections

641629d

maropu reviewed May 13, 2021

View reviewed changes

planga82 added 2 commits May 13, 2021 15:14

Merge remote-tracking branch 'upstream/master' into feature/spark3520…

a671ce7

…7_hashnegativezero

Improve solution

1de9c3d

dongjoon-hyun reviewed May 13, 2021

View reviewed changes

viirya reviewed May 13, 2021

View reviewed changes

planga82 added 2 commits May 13, 2021 20:22

Update migration guide

869ae7c

Simplify expressions

c123599

github-actions bot added the DOCS label May 14, 2021

planga82 changed the title ~~[SPARK-35207][SQL] Normalize hash function behavior with negative zero~~ [SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) May 14, 2021

cloud-fan approved these changes May 14, 2021

View reviewed changes

cloud-fan closed this in 9ea55fe May 14, 2021

[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) #32496

[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) #32496

Conversation

planga82 commented May 11, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented May 12, 2021

maropu commented May 12, 2021

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

planga82 commented May 12, 2021

maropu commented May 12, 2021

maropu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan May 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2021

planga82 commented May 13, 2021

dongjoon-hyun commented May 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2021

SparkQA commented May 13, 2021

viirya left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

planga82 commented May 14, 2021

SparkQA commented May 14, 2021

SparkQA commented May 14, 2021

SparkQA commented May 14, 2021

srowen commented May 14, 2021

cloud-fan commented May 14, 2021 • edited Loading

SparkQA commented May 14, 2021

cloud-fan May 13, 2021 •

edited

Loading

cloud-fan commented May 14, 2021 •

edited

Loading