Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support various outcomes for NaN equality comparison in equality comparator #10781

Closed
ttnghia opened this issue May 3, 2022 · 5 comments
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@ttnghia
Copy link
Contributor

ttnghia commented May 3, 2022

Currently, cudf::equality_compare specialized for floating point numbers considers NaNs as equal. That result may not be desirable in some cases. Depending on situations, sometimes we want to have NaNs considered as equal, but some other time we want the opposite.

We should reimplement cudf::equality_compare and the corresponding comparators chain (element comparator, row comparator etc), adding an optional parameter:

enum class nan_equality /*unspecified*/ {
  ALL_EQUAL,  ///< All NaNs compare equal, regardless of sign
  UNEQUAL     ///< All NaNs compare unequal (IEEE754 behavior)
};

Currently, there are many APIs that support this parameter (such as lists::drop_list_duplicates) but these APIs implement such support locally. As more and more APIs adopt it, NaN handling should be supported in a more structural way.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels May 3, 2022
@ttnghia ttnghia added the Spark Functionality that helps Spark RAPIDS label May 3, 2022
@jrhemstad
Copy link
Contributor

See also #4760

I think this change to the row comparators is fine so long as it is proven that it does not impact performance for all existing cases.

However, I still maintain that it is a bad idea to make any promises that we will support the full extent of NaN nonsense that Spark does.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 20, 2022

I'm thinking about a way that allows having different NaN outcomes while that has minimal impact on cudf build and its downstream usage. One way for this is to have template internal APIs (similar to #10870) but we only instantiate one NaN config in cudf (which is the default, current config). The remaining config (which Spark wants) will be instantiated somewhere else (rapids-spark-jni).

@jrhemstad
Copy link
Contributor

I'm thinking about a way that allows having different NaN outcomes while that has minimal impact on cudf build and its downstream usage. One way for this is to have template internal APIs (similar to #10870) but we only instantiate one NaN config in cudf (which is the default, current config). The remaining config (which Spark wants) will be instantiated somewhere else (rapids-spark-jni).

That is the solution I described here: #4760 (comment)

It's possible for some things, but it would be a good bit of work to make it robust.

rapids-bot bot pushed a commit that referenced this issue Jun 3, 2022
Further splitting up #9452 -- split off at the suggestion of @bdice 

Related to #10781 and #4760 -- issues and discussions related to NaN comparison behavior.

Authors:
  - Ryan Lee (https://github.com/rwlee)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10870
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 20, 2022

Now NaN handling should be good with the current experimental row comparator. Close this.

@ttnghia ttnghia closed this as completed Jun 20, 2022
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants