-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194
Conversation
Test build #36419 has finished for PR 7194 at commit
|
One subtlety: there can be multiple float / double bitpatterns that are NaN, so clustered sorting based on the bitpatterns is not always sufficient to properly implement COUNT DISTINCT over a set of grouping columns which may contain NaN values. |
@JoshRosen There are also some problems when join or aggregation (NaN is used a part of key in HashMap). I prefer to turn all NaN into null (during inbound conversion and update mutable row). |
I'm going to close this PR for now while we explore whether to do the NaN -> null conversions. If we decide not to go with that approach, then we can re-open and revisit. |
Re-opening this after some discussion with @rxin; I'm going to re-work this so that NaN is treated as the maximum value when sorting. Note that this will not fix some of the more general correctness issues with NaNs that appear in grouping keys, etc., but it will at least prevent crashes. |
(Still in the process of cleaning this up; pushing only so I can view diffs more nicely in GitHub). |
Test build #37683 has finished for PR 7194 at commit
|
Test build #37682 has finished for PR 7194 at commit
|
@@ -97,7 +93,8 @@ class UnsafeExternalSortSuite extends SparkPlanTest with BeforeAndAfterAll { | |||
inputDf, | |||
UnsafeExternalSort(sortOrder, global = true, _: SparkPlan, testSpillFrequency = 23), | |||
Sort(sortOrder, global = true, _: SparkPlan), | |||
sortAnswers = false | |||
sortAnswers = false, | |||
compareStrings = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of comparing as String, could you update the Row.equals() to handle NaN (we need to do this eventually)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point; I'll just roll the equality changes into this patch.
…parability." This reverts commit a30d371.
def testNaN(nan: Expression): Unit = { | ||
checkEvaluation(nan === nan, true) | ||
checkEvaluation(nan <=> nan, true) | ||
// checkEvaluation(nan <= nan, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, this test case fails even though I updated GeneratedOrdering and the interpreted orderings to support our defined NaN semantics. This implies that we may be using the wrong ordering in the implementation of these expressions.
If it turns out that those expressions are mis-handling orderings in a more general way, then I'll open a separate PR to fix that (I suspect that we'll see similar failures when trying to order byte arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I also see now that I should remove this test and add NaN
literals to the equalValues
list above.
@JoshRosen , Get it, thanks for explanation. |
Alright, I've updated this to fix the binary comparison expression issues and have also implemented canonicalization of NaN values in UnsafeRow. |
} | ||
case f1: Float => | ||
if (!o2.isInstanceOf[Float] || | ||
(java.lang.Float.isNaN(f1) && !java.lang.Float.isNaN(o2.asInstanceOf[Float]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should compare o2 and o1, can we call nanSafeCompare() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh; looks like I was a bit sloppy here. Yeah, the few extra comparisons in nanSafeCompare isn't a big deal; I'll update this to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I don't think that we can use nanSafeCompare without breaking existing test code / user code: the old code would allow integers and floats to be compared because Java would handle implicit type conversions. Therefore, for compatibility I think we need to do the same here.
I think it will be clearer to rework this as something like "if f1 is a float and it's NaN, then the other value had better be a NaN float, otherwise fall back to the regular ==
branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better.
Test build #37748 has finished for PR 7194 at commit
|
Test build #37750 has finished for PR 7194 at commit
|
Test build #37751 has finished for PR 7194 at commit
|
Test build #37753 has finished for PR 7194 at commit
|
LGTM |
Test build #37754 has finished for PR 7194 at commit
|
Jenkins, retest this please. |
Test build #37877 has finished for PR 7194 at commit
|
Test failures are due to unrelated known flaky tests, so I'm going to merge this into master. |
YAY |
This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that
NaN > anything
,NaN == anything
, andNaN < anything
all returnfalse
.Per the design specified in SPARK-9079, we have decided that
NaN = NaN
should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).