Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match Spark's NaN handling in collect_set #8909

Merged
merged 9 commits into from
Aug 9, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Aug 2, 2023

Fixes #8808

collect_set function currently does not support NaNs in struct[Array(Double)] or struct[Array(Float)] types. It is because Spark handling NaN's equality by different way for non-nested float/double and float/double in nested types.

This PR checks and sets the NaNEquality in collectSet dynamically. If we are doing a collect set on an Float or Double column, then we set NaNEquality to UNEQUAL, but if we are doing a collect set on anything else we set it to ALL_EQUAL.

Changes in tests and gpuOverrides are basically revert of #8783

@thirtiseven thirtiseven marked this pull request as draft August 2, 2023 11:16
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Aug 2, 2023

Oh I didn't remove the xfailed case and it is still failed... Looking into it, convert it to draft
Passed now

@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven self-assigned this Aug 3, 2023
@thirtiseven thirtiseven marked this pull request as ready for review August 3, 2023 05:23
// Because NaNs in nested types are not supported yet.
// See https://github.com/NVIDIA/spark-rapids/issues/8808
if (c.child.dataType != FloatType && c.child.dataType != DoubleType &&
isOrContainsFloatingPoint(c.child.dataType)) {
Copy link
Collaborator

@abellina abellina Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check performed a recursive check for floats/doubles at all nested levels, but your changes seem to say that Array[Double] and Array[Float] are UNEQUAL. Should your code handle Array[Array[Double]] (and any level of nesting, including things like Struct[Array[Double]] for example)? Are those cases UNEQUAL or are they somehow compared differently. If so we should add comments clarifying why an array/struct a 2nd+ level is ok.

Copy link
Collaborator Author

@thirtiseven thirtiseven Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes it's a bit confusing. I added a comment.

In GpuCollectBase we have:

def child: Expression
override def dataType: DataType = ArrayType(child.dataType, containsNull = false)

My change wants to make NaNs UNEQUAL only if child.dataType is double or float, otherwise ALL_EQUAL.

But when we call CudfCollectSet or CudfMergeSets we need to pass dataType to it. So when I check if datatype is Array(Double) or Array(Float), what I really want to check is if child.dataType is float or double.

(I hope I made it clear...)

@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven requested review from abellina and revans2 August 7, 2023 01:09
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, I just had the same comments as Jason about not including nans in some places in the tests.

@sameerz sameerz added the feature request New feature or request label Aug 9, 2023
@thirtiseven
Copy link
Collaborator Author

@jlowe @revans2 Thanks for review, I removed those NaN datagens.

@thirtiseven
Copy link
Collaborator Author

build

@jlowe jlowe merged commit 1b9f83d into NVIDIA:branch-23.08 Aug 9, 2023
@sameerz sameerz added bug Something isn't working and removed feature request New feature or request labels Aug 9, 2023
@thirtiseven thirtiseven deleted the collect_set_fix_2 branch August 18, 2023 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Match Spark's NaN handling in collect_set for nested type contains float/double
5 participants