[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194

JoshRosen · 2015-07-02T18:03:38Z

This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that NaN > anything, NaN == anything, and NaN < anything all return false.

Per the design specified in SPARK-9079, we have decided that NaN = NaN should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).

In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).

SparkQA · 2015-07-02T18:31:43Z

Test build #36419 has finished for PR 7194 at commit 630ebc5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-03T05:39:55Z

One subtlety: there can be multiple float / double bitpatterns that are NaN, so clustered sorting based on the bitpatterns is not always sufficient to properly implement COUNT DISTINCT over a set of grouping columns which may contain NaN values.

davies · 2015-07-13T23:32:34Z

@JoshRosen There are also some problems when join or aggregation (NaN is used a part of key in HashMap). I prefer to turn all NaN into null (during inbound conversion and update mutable row).

JoshRosen · 2015-07-14T21:55:24Z

I'm going to close this PR for now while we explore whether to do the NaN -> null conversions. If we decide not to go with that approach, then we can re-open and revisit.

JoshRosen · 2015-07-18T01:32:39Z

Re-opening this after some discussion with @rxin; I'm going to re-work this so that NaN is treated as the maximum value when sorting. Note that this will not fix some of the more general correctness issues with NaNs that appear in grouping keys, etc., but it will at least prevent crashes.

JoshRosen · 2015-07-18T01:45:39Z

(Still in the process of cleaning this up; pushing only so I can view diffs more nicely in GitHub).

SparkQA · 2015-07-18T02:01:45Z

Test build #37683 has finished for PR 7194 at commit d907b5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NaiveBayes(override val uid: String)
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
- class PCA(JavaEstimator, HasInputCol, HasOutputCol):
- class PCAModel(JavaModel):
- abstract class Expression extends TreeNode[Expression] with Product
- trait Generator extends Expression
- abstract class UnaryLogExpression(f: Double => Double, name: String)
- case class Conv(numExpr: Expression, fromBaseExpr: Expression, toBaseExpr: Expression)
- case class Log(child: Expression) extends UnaryLogExpression(math.log, "LOG")
- case class Log10(child: Expression) extends UnaryLogExpression(math.log10, "LOG10")
- case class Log1p(child: Expression) extends UnaryLogExpression(math.log1p, "LOG1P")
- trait NamedExpression extends Expression
- abstract class Attribute extends LeafExpression with NamedExpression
- case class IsNaN(child: Expression) extends UnaryExpression
- abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging with Product
- abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Product with Serializable
- trait HashSemiJoin

SparkQA · 2015-07-18T03:28:17Z

Test build #37682 has finished for PR 7194 at commit 630ebc5.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

davies · 2015-07-18T19:34:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeExternalSortSuite.scala

@@ -97,7 +93,8 @@ class UnsafeExternalSortSuite extends SparkPlanTest with BeforeAndAfterAll {
        inputDf,
        UnsafeExternalSort(sortOrder, global = true, _: SparkPlan, testSpillFrequency = 23),
        Sort(sortOrder, global = true, _: SparkPlan),
-        sortAnswers = false
+        sortAnswers = false,
+        compareStrings = true


Instead of comparing as String, could you update the Row.equals() to handle NaN (we need to do this eventually)?

Yeah, good point; I'll just roll the equality changes into this patch.

…parability." This reverts commit a30d371.

JoshRosen · 2015-07-19T02:33:42Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala

+    def testNaN(nan: Expression): Unit = {
+      checkEvaluation(nan === nan, true)
+      checkEvaluation(nan <=> nan, true)
+//      checkEvaluation(nan <= nan, true)


Interestingly, this test case fails even though I updated GeneratedOrdering and the interpreted orderings to support our defined NaN semantics. This implies that we may be using the wrong ordering in the implementation of these expressions.

If it turns out that those expressions are mis-handling orderings in a more general way, then I'll open a separate PR to fix that (I suspect that we'll see similar failures when trying to order byte arrays).

Ah, I also see now that I should remove this test and add NaN literals to the equalValues list above.

yjshen · 2015-07-19T02:41:49Z

@JoshRosen , Get it, thanks for explanation.

JoshRosen · 2015-07-19T03:33:28Z

Alright, I've updated this to fix the binary comparison expression issues and have also implemented canonicalization of NaN values in UnsafeRow.

davies · 2015-07-19T04:22:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala

+            }
+          case f1: Float =>
+            if (!o2.isInstanceOf[Float] ||
+              (java.lang.Float.isNaN(f1) && !java.lang.Float.isNaN(o2.asInstanceOf[Float]))) {


We should compare o2 and o1, can we call nanSafeCompare() ?

Argh; looks like I was a bit sloppy here. Yeah, the few extra comparisons in nanSafeCompare isn't a big deal; I'll update this to use that.

actually, I don't think that we can use nanSafeCompare without breaking existing test code / user code: the old code would allow integers and floats to be compared because Java would handle implicit type conversions. Therefore, for compatibility I think we need to do the same here.

I think it will be clearer to rework this as something like "if f1 is a float and it's NaN, then the other value had better be a NaN float, otherwise fall back to the regular == branch).

That's better.

SparkQA · 2015-07-19T04:31:57Z

Test build #37748 has finished for PR 7194 at commit 7fe67af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-19T05:21:35Z

Test build #37750 has finished for PR 7194 at commit fbb2a29.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-19T05:27:36Z

Test build #37751 has finished for PR 7194 at commit a702e2e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable
- case class UnresolvedFunction(name: String, children: Seq[Expression])
- case class UnresolvedStar(table: Option[String]) extends Star with Unevaluable
- case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable
- case class UnresolvedAlias(child: Expression)
- case class Cast(child: Expression, dataType: DataType)
- trait Unevaluable
- case class SortOrder(child: Expression, direction: SortDirection)
- trait AggregateExpression extends Expression with Unevaluable
- case class Abs(child: Expression)
- trait CodegenFallback
- case class CreateArray(children: Seq[Expression]) extends Expression with CodegenFallback
- case class CreateStruct(children: Seq[Expression]) extends Expression with CodegenFallback
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression with CodegenFallback
- case class CurrentDate() extends LeafExpression with CodegenFallback
- case class CurrentTimestamp() extends LeafExpression with CodegenFallback
- case class Explode(child: Expression) extends UnaryExpression with Generator with CodegenFallback
- case class Literal protected (value: Any, dataType: DataType)
- case class Hex(child: Expression)
- case class Unhex(child: Expression)
- case class PrettyAttribute(name: String) extends Attribute with Unevaluable
- case class In(value: Expression, list: Seq[Expression]) extends Predicate with CodegenFallback
- case class NewSet(elementType: DataType) extends LeafExpression with CodegenFallback
- case class AddItemToSet(item: Expression, set: Expression)
- case class CombineSets(left: Expression, right: Expression)
- case class CountSet(child: Expression) extends UnaryExpression with CodegenFallback
- case class Upper(child: Expression)
- case class StringFormat(children: Expression*) extends Expression with CodegenFallback
- case class StringSpace(child: Expression)
- case class Ascii(child: Expression)
- case class Base64(child: Expression)
- case class UnBase64(child: Expression)

SparkQA · 2015-07-19T07:06:47Z

Test build #37753 has finished for PR 7194 at commit 88bd73c.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable
- case class UnresolvedFunction(name: String, children: Seq[Expression])
- case class UnresolvedStar(table: Option[String]) extends Star with Unevaluable
- case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable
- case class UnresolvedAlias(child: Expression)
- case class Cast(child: Expression, dataType: DataType)
- trait Unevaluable
- case class SortOrder(child: Expression, direction: SortDirection)
- trait AggregateExpression extends Expression with Unevaluable
- case class Abs(child: Expression)
- trait CodegenFallback
- case class CreateArray(children: Seq[Expression]) extends Expression with CodegenFallback
- case class CreateStruct(children: Seq[Expression]) extends Expression with CodegenFallback
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression with CodegenFallback
- case class CurrentDate() extends LeafExpression with CodegenFallback
- case class CurrentTimestamp() extends LeafExpression with CodegenFallback
- case class Explode(child: Expression) extends UnaryExpression with Generator with CodegenFallback
- case class Literal protected (value: Any, dataType: DataType)
- case class Hex(child: Expression)
- case class Unhex(child: Expression)
- case class PrettyAttribute(name: String) extends Attribute with Unevaluable
- case class In(value: Expression, list: Seq[Expression]) extends Predicate with CodegenFallback
- case class NewSet(elementType: DataType) extends LeafExpression with CodegenFallback
- case class AddItemToSet(item: Expression, set: Expression)
- case class CombineSets(left: Expression, right: Expression)
- case class CountSet(child: Expression) extends UnaryExpression with CodegenFallback
- case class Upper(child: Expression)
- case class StringFormat(children: Expression*) extends Expression with CodegenFallback
- case class StringSpace(child: Expression)
- case class Ascii(child: Expression)
- case class Base64(child: Expression)
- case class UnBase64(child: Expression)

davies · 2015-07-19T07:11:16Z

LGTM

SparkQA · 2015-07-19T07:20:31Z

Test build #37754 has finished for PR 7194 at commit 983d4fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-21T00:21:05Z

Jenkins, retest this please.

SparkQA · 2015-07-21T02:23:28Z

Test build #37877 has finished for PR 7194 at commit 983d4fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-21T05:33:53Z

Test failures are due to unrelated known flaky tests, so I'm going to merge this into master.

rxin · 2015-07-21T05:35:25Z

YAY

JoshRosen added 10 commits July 1, 2015 20:18

Add random data generator test utilities to Spark SQL.

d2b4a4a

Move code to Catalyst package.

ab76cbd

Infinity and NaN are interesting.

5acdd5c

Generate doubles and floats over entire possible range.

b55875a

Add regression test for SPARK-8782 (ORDER BY NULL)

7d5c13e

Add very generic test for ordering

e7dc4fb

Fix ORDER BY NULL

f9efbb5

Add regression test for NaN sorting issue

13fc06a

Re-enable NaNs in CodeGenerationSuite to produce more regression tests

9bf195a

Specify an ordering for NaN values.

630ebc5

JoshRosen closed this Jul 14, 2015

JoshRosen reopened this Jul 18, 2015

JoshRosen changed the title ~~[SPARK-8797] [WIP] Fix comparison of NaN values in Spark SQL~~ [SPARK-8797] [SPARK-9146] [WIP] Fix comparison of NaN values in Spark SQL Jul 18, 2015

Merge remote-tracking branch 'origin/master' into nan

d907b5b

JoshRosen added 7 commits July 17, 2015 19:04

Fix compilation of CodeGenerationSuite

5b88b2b

Add failing test for new NaN comparision ordering

b20837b

Update randomized test to use ScalaTest's assume()

8d7be61

Change ordering so that NaN is maximum value.

bfca524

Stop filtering NaNs in UnsafeExternalSortSuite

42a1ad5

Fix bug in Double / Float ordering

6f03f85

Compare rows' string representations to work around NaN incomparability.

a30d371

Fix prefix comparision for NaNs

a2ba2e7

davies reviewed Jul 18, 2015
View reviewed changes

JoshRosen added 2 commits July 18, 2015 18:37

Revert "Compare rows' string representations to work around NaN incom…

58bad2c

…parability." This reverts commit a30d371.

Support NaN == NaN (SPARK-9145)

7fe67af

JoshRosen changed the title ~~[SPARK-8797] [SPARK-9146] Fix comparison of NaN values in Spark SQL~~ [SPARK-8797] [SPARK-9146] [SPARK-9145] Support NaN ordering and equality comparisons in Spark SQL Jul 19, 2015

JoshRosen reviewed Jul 19, 2015
View reviewed changes

Uncomment failing tests

b31eb19

JoshRosen added 5 commits July 18, 2015 20:03

Fold NaN test into existing test framework

c1fd4fe

Fix NaN comparisons in BinaryComparison expressions

fbb2a29

Merge remote-tracking branch 'origin/master' into nan

fe629ae

Normalize NaNs in UnsafeRow

a7267cf

normalization -> canonicalization

a702e2e

JoshRosen changed the title ~~[SPARK-8797] [SPARK-9146] [SPARK-9145] Support NaN ordering and equality comparisons in Spark SQL~~ [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL Jul 19, 2015

davies reviewed Jul 19, 2015
View reviewed changes

JoshRosen added 2 commits July 18, 2015 21:56

Fix Row.equals()

88bd73c

Merge remote-tracking branch 'origin/master' into nan

983d4fc

asfgit closed this in c032b0b Jul 21, 2015

JoshRosen deleted the nan branch August 29, 2016 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194

JoshRosen commented Jul 2, 2015

SparkQA commented Jul 2, 2015

JoshRosen commented Jul 3, 2015

davies commented Jul 13, 2015

JoshRosen commented Jul 14, 2015

JoshRosen commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

SparkQA commented Jul 18, 2015

SparkQA commented Jul 18, 2015

davies Jul 18, 2015

JoshRosen Jul 18, 2015

JoshRosen Jul 19, 2015

JoshRosen Jul 19, 2015

yjshen commented Jul 19, 2015

JoshRosen commented Jul 19, 2015

davies Jul 19, 2015

JoshRosen Jul 19, 2015

JoshRosen Jul 19, 2015

davies Jul 19, 2015

SparkQA commented Jul 19, 2015

SparkQA commented Jul 19, 2015

SparkQA commented Jul 19, 2015

SparkQA commented Jul 19, 2015

davies commented Jul 19, 2015

SparkQA commented Jul 19, 2015

JoshRosen commented Jul 21, 2015

SparkQA commented Jul 21, 2015

JoshRosen commented Jul 21, 2015

rxin commented Jul 21, 2015

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194

Conversation

JoshRosen commented Jul 2, 2015

SparkQA commented Jul 2, 2015

JoshRosen commented Jul 3, 2015

davies commented Jul 13, 2015

JoshRosen commented Jul 14, 2015

JoshRosen commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

SparkQA commented Jul 18, 2015

SparkQA commented Jul 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjshen commented Jul 19, 2015

JoshRosen commented Jul 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 19, 2015

SparkQA commented Jul 19, 2015

SparkQA commented Jul 19, 2015

SparkQA commented Jul 19, 2015

davies commented Jul 19, 2015

SparkQA commented Jul 19, 2015

JoshRosen commented Jul 21, 2015

SparkQA commented Jul 21, 2015

JoshRosen commented Jul 21, 2015

rxin commented Jul 21, 2015