-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add Python bindings for drop_list_duplicates #7414
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Comments
I think this would make sense to expose as something like |
This was referenced Mar 3, 2021
We have a new PR for |
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 12, 2021
Closes #7494 and partially addresses #7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: #7528
This was referenced Mar 12, 2021
Merged
hyperbolic2346
pushed a commit
to hyperbolic2346/cudf
that referenced
this issue
Mar 25, 2021
Closes rapidsai#7494 and partially addresses rapidsai#7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: rapidsai#7528
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 31, 2021
Closes #7414 This PR adds `list.unique` API. Following `Series.unique` behavior, this API treats null values as equal, and treats all nans as equal. This API does not guarantee the order of list elements. Example: ```python >>> s = cudf.Series([[1, 1, 2, None, None], None, [np.nan, np.nan], []]) >>> s.list.unique() # Order of list elements is not gaurenteed 0 [1.0, 2.0, nan] 1 None 2 [nan] 3 [] dtype: list ``` Authors: - Michael Wang (@isVoid) Approvers: - Keith Kraus (@kkraus14) - Nghia Truong (@ttnghia) URL: #7664
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Is your feature request related to a problem? Please describe.
I wish I could deduplicate the elements in a column list. For example, applying the function to this dataframe:
0 [optimus, a, optimus]
1 [optimus, b]
2 [optimus, c]
Name: A, dtype: list
should output
0 [optimus, a]
1 [optimus, b]
2 [optimus, c]
Name: A, dtype: list
Describe the solution you'd like
In pandas you could use someting like
df.apply(lambda x: set(x['column_list']), axis=1)
. No sure which could be the best approach in cudfDescribe alternatives you've considered
The text was updated successfully, but these errors were encountered: