Refactor `collect_set` to use `cudf::distinct` and `cudf::lists::distinct` #11228

ttnghia · 2022-07-08T17:50:33Z

The current groupby/reducttion collect_set aggregations use lists::drop_list_duplicates to generate set(s) of distinct elements. This PR changes that to use cudf::distinct and cudf::lists::distinct instead, which have some advantages including:

Fully supporting nested types, and:
Achieving better performance (O(n) instead of O(nlogn)) by internally using hash table instead of segmented sort.

This also enables nested types support for collect_set in spark-rapids (issue NVIDIA/spark-rapids#5508).

The changes in Java code here are only to fix unit tests. Previously, they were implemented with the assumption that the collect_set results are sorted, now they fail when the results are no longer sorted.

java/src/test/java/ai/rapids/cudf/ReductionTest.java

jlowe

Java approval

mythrocks

Some minor nitpicks. Looks good, otherwise.

cpp/src/groupby/sort/aggregate.cpp

cpp/src/reductions/collect_ops.cu

cpp/tests/groupby/collect_set_tests.cpp

cpp/src/reductions/collect_ops.cu

cpp/tests/groupby/merge_sets_tests.cpp

mythrocks

LGTM!

mythrocks · 2022-07-14T22:03:33Z

Ah, it looks like #11250 wrecked this. (I didn't realize it's been merged.)
#11250 was a simple change. One only need switch drop_list_duplicates to distinct in rolling/detail/rolling.cuh.

Edit: @ttnghia beat me to it. Looks good now. 👍

ttnghia · 2022-07-15T01:58:24Z

@gpucibot merge

This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work. Closes #11114, #11093, #11053, #11034, and closes #9257. Depends on: * #11228 * #11149 * #11234 * #11233 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11236

ttnghia added 30 commits June 24, 2022 14:51

Add new implementation and test files

1d7e8e0

Fix compile error

51b80db

Rename function

08a76ad

Implement cudf::detail::stable_distinct and lists::distinct

16101f7

Rewrite doxygen

5ec13d6

Rename variable

6c5b738

Rewrite comment

5b70eee

Rename files

238248d

Implement float tests

ba6bf6b

Implement string tests

3845c95

Implement tests for ListDistinctTypedTest

507c82d

Complete the remaining tests

2cb8347

Merge branch 'branch-22.08' into add_lists_distinct

7efdea0

Rewrite doxygen

4388637

Misc

4dd5e74

Misc

3b0760c

Rewrite test

9730b70

Fix doxygen

9bd9b6f

Fix header

790a482

Rewrite doxygen

1c58baa

Rewrite doxygen and fix headers

d493c4f

Fix iterator type

d090d2a

Rewrite doxygen

ee51822

Add empty lines

ccdd6f0

Merge branch 'branch-22.08' into add_lists_distinct

034ee2a

Update default stream

b1231a2

Merge branch 'branch-22.08' into add_lists_distinct

af91b80

Merge branch 'branch-22.08' into add_lists_distinct

86c9ba8

Handle empty input

99d70b1

Merge branch 'add_lists_distinct' into refactor_collect_set

cf965f6

github-actions bot added the Java Affects Java cuDF API. label Jul 12, 2022

ttnghia added 3 - Ready for Review Ready for review by team and removed 0 - Blocked Cannot progress due to external reasons labels Jul 12, 2022

jlowe reviewed Jul 12, 2022

View reviewed changes

java/src/test/java/ai/rapids/cudf/ReductionTest.java Show resolved Hide resolved

java/src/test/java/ai/rapids/cudf/ReductionTest.java Show resolved Hide resolved

jlowe approved these changes Jul 12, 2022

View reviewed changes

mythrocks requested changes Jul 12, 2022

View reviewed changes

cpp/src/groupby/sort/aggregate.cpp Outdated Show resolved Hide resolved

cpp/src/reductions/collect_ops.cu Outdated Show resolved Hide resolved

cpp/src/reductions/collect_ops.cu Show resolved Hide resolved

cpp/tests/groupby/collect_set_tests.cpp Outdated Show resolved Hide resolved

ttnghia added 5 commits July 13, 2022 10:46

Misc

37d60cf

Optimize collect_set in reduction

1f10a16

Rewrite collect_set_tests

eee0bbb

Merge branch 'branch-22.08' into refactor_collect_set

23c44e4

Misc

b55ba30

ttnghia requested a review from mythrocks July 13, 2022 18:24

davidwendt reviewed Jul 13, 2022

View reviewed changes

cpp/src/reductions/collect_ops.cu Show resolved Hide resolved

Add extra blank line

73287ec

davidwendt reviewed Jul 13, 2022

View reviewed changes

cpp/tests/groupby/merge_sets_tests.cpp Outdated Show resolved Hide resolved

Use sort_by_key

47d3298

ttnghia requested a review from davidwendt July 13, 2022 19:48

Add/remove comments

a23356c

davidwendt approved these changes Jul 14, 2022

View reviewed changes

mythrocks approved these changes Jul 14, 2022

View reviewed changes

Merge branch 'branch-22.08' into refactor_collect_set

b1f1890

rapids-bot bot merged commit b654597 into rapidsai:branch-22.08 Jul 15, 2022

NVnavkumar mentioned this pull request Jul 25, 2022

Add support for nested types to collect_set(...) on the GPU [databricks] NVIDIA/spark-rapids#6079

Merged

ttnghia deleted the refactor_collect_set branch July 28, 2022 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `collect_set` to use `cudf::distinct` and `cudf::lists::distinct` #11228

Refactor `collect_set` to use `cudf::distinct` and `cudf::lists::distinct` #11228

ttnghia commented Jul 8, 2022 •

edited

Loading

jlowe left a comment

mythrocks left a comment

mythrocks left a comment

mythrocks commented Jul 14, 2022 •

edited

Loading

ttnghia commented Jul 15, 2022

Refactor collect_set to use cudf::distinct and cudf::lists::distinct #11228

Refactor collect_set to use cudf::distinct and cudf::lists::distinct #11228

Conversation

ttnghia commented Jul 8, 2022 • edited Loading

jlowe left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks commented Jul 14, 2022 • edited Loading

ttnghia commented Jul 15, 2022

Refactor `collect_set` to use `cudf::distinct` and `cudf::lists::distinct` #11228

Refactor `collect_set` to use `cudf::distinct` and `cudf::lists::distinct` #11228

ttnghia commented Jul 8, 2022 •

edited

Loading

mythrocks commented Jul 14, 2022 •

edited

Loading