Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Simplify code for NaN handling in lists/drop_list_duplicates #9257

Closed
ttnghia opened this issue Sep 20, 2021 · 2 comments · Fixed by #11236
Closed

[FEA] Simplify code for NaN handling in lists/drop_list_duplicates #9257

ttnghia opened this issue Sep 20, 2021 · 2 comments · Fixed by #11236
Assignees
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Sep 20, 2021

Currently, drop_list_duplicates requires an input parameter specifying whether NaN values should be considered as equal or not. This parameter fulfills different desired behaviors in both Pandas and Spark. Inside drop_list_duplicates, the implementation code needs to pass that parameter down to multiple levels, increasing the complexity of the implementation and leading to burdensome in maintanance.

We should simplify the code somehow, reducing the number of code paths, or at least removing the passing-down parameter. Another potential way for this may be as recommended in #9202 (comment), which worth to explore.

@ttnghia ttnghia added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 20, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@ttnghia ttnghia self-assigned this Jun 30, 2022
rapids-bot bot pushed a commit that referenced this issue Jul 22, 2022
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work.

Closes #11114, #11093, #11053, #11034, and closes #9257.

Depends on:
 * #11228
 * #11149
 * #11234
 * #11233

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #11236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
1 participant