-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match Spark's NaN handling in collect_set #8909
Conversation
cc038d8
to
04e62e3
Compare
|
build |
// Because NaNs in nested types are not supported yet. | ||
// See https://github.com/NVIDIA/spark-rapids/issues/8808 | ||
if (c.child.dataType != FloatType && c.child.dataType != DoubleType && | ||
isOrContainsFloatingPoint(c.child.dataType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check performed a recursive check for floats/doubles at all nested levels, but your changes seem to say that Array[Double]
and Array[Float]
are UNEQUAL
. Should your code handle Array[Array[Double]]
(and any level of nesting, including things like Struct[Array[Double]]
for example)? Are those cases UNEQUAL
or are they somehow compared differently. If so we should add comments clarifying why an array/struct a 2nd+ level is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes it's a bit confusing. I added a comment.
In GpuCollectBase
we have:
def child: Expression
override def dataType: DataType = ArrayType(child.dataType, containsNull = false)
My change wants to make NaNs UNEQUAL only if child.dataType
is double or float, otherwise ALL_EQUAL.
But when we call CudfCollectSet
or CudfMergeSets
we need to pass dataType
to it. So when I check if datatype
is Array(Double)
or Array(Float)
, what I really want to check is if child.dataType
is float or double.
(I hope I made it clear...)
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, I just had the same comments as Jason about not including nans in some places in the tests.
build |
Fixes #8808
collect_set
function currently does not support NaNs in struct[Array(Double)] or struct[Array(Float)] types. It is because Spark handling NaN's equality by different way for non-nested float/double and float/double in nested types.This PR checks and sets the
NaNEquality
in collectSet dynamically. If we are doing a collect set on an Float or Double column, then we set NaNEquality to UNEQUAL, but if we are doing a collect set on anything else we set it to ALL_EQUAL.Changes in tests and gpuOverrides are basically revert of #8783