[FEA] Add Python bindings for drop_list_duplicates #7414

argenisleon · 2021-02-19T01:46:08Z

Is your feature request related to a problem? Please describe.
I wish I could deduplicate the elements in a column list. For example, applying the function to this dataframe:
0 [optimus, a, optimus]
1 [optimus, b]
2 [optimus, c]
Name: A, dtype: list

should output
0 [optimus, a]
1 [optimus, b]
2 [optimus, c]
Name: A, dtype: list

Describe the solution you'd like
In pandas you could use someting like df.apply(lambda x: set(x['column_list']), axis=1). No sure which could be the best approach in cudf

Describe alternatives you've considered

text = ["Optimus a", 
        "Optimus b Optimus",
       "Optims c"]
df = cudf.DataFrame({'text':text,'doc_id':[0,1,2]})
tokens = df.text.str.tokenize(' ')
tk_cnts = df.text.str.token_count(' ')
global_pos = df.doc_id.repeat(tk_cnts).reset_index(drop=True)

tokenized_df = cudf.DataFrame({'token':tokens,'doc_id':global_pos})
tokenized_df['token_order']=1
tokenized_df['token_order'] = tokenized_df['token_order'].cumsum()

tokenized_df.columns = ['token_ser', 'output_column_id_ser', 'ouput_token_order_ser']
tokenized_df= tokenized_df.drop_duplicates(["token_ser","output_column_id_ser"])
tokenized_df["token_ser"] = tokenized_df["token_ser"].str.detokenize('output_column_id_ser')

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-02-19T01:49:48Z

I think this would make sense to expose as something like cudf.Series.list.unique() in cuDF Python

ttnghia · 2021-03-07T22:08:53Z

We have a new PR for drop_list_duplicates in cudf C++: #7528
Please work on the python side if necessary. Thanks.

@ttnghia

Closes #7494 and partially addresses #7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: #7528

@ttnghia

Closes rapidsai#7494 and partially addresses rapidsai#7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: rapidsai#7528

@isVoid

Closes #7414 This PR adds `list.unique` API. Following `Series.unique` behavior, this API treats null values as equal, and treats all nans as equal. This API does not guarantee the order of list elements. Example: ```python >>> s = cudf.Series([[1, 1, 2, None, None], None, [np.nan, np.nan], []]) >>> s.list.unique() # Order of list elements is not gaurenteed 0 [1.0, 2.0, nan] 1 None 2 [nan] 3 [] dtype: list ``` Authors: - Michael Wang (@isVoid) Approvers: - Keith Kraus (@kkraus14) - Nghia Truong (@ttnghia) URL: #7664

argenisleon added Needs Triage Need team to review and classify feature request New feature or request labels Feb 19, 2021

kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Feb 19, 2021

ttnghia mentioned this issue Feb 23, 2021

Implement groupby collect_set #7420

Merged

This was referenced Mar 3, 2021

[FEA] Implement drop_list_duplicates #7494

Closed

Implement drop_list_duplicates #7528

Merged

randerzander changed the title ~~[FEA] Deduplicate element in column lists~~ [FEA] Add Python bindings for drop_list_duplicates Mar 12, 2021

This was referenced Mar 12, 2021

[FEA] Python bindings for lists::sort #7467

Closed

[FEA] Support drop_duplicates on Series containing list objects #6784

Closed

kkraus14 assigned isVoid Mar 18, 2021

isVoid mentioned this issue Mar 22, 2021

Adds list.unique API #7664

Merged

kkraus14 mentioned this issue Mar 23, 2021

[FEA] Python bindings for lists::drop_list_duplicates #7582

Closed

rapids-bot bot closed this as completed in #7664 Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add Python bindings for drop_list_duplicates #7414

[FEA] Add Python bindings for drop_list_duplicates #7414

argenisleon commented Feb 19, 2021

kkraus14 commented Feb 19, 2021

ttnghia commented Mar 7, 2021

[FEA] Add Python bindings for drop_list_duplicates #7414

[FEA] Add Python bindings for drop_list_duplicates #7414

Comments

argenisleon commented Feb 19, 2021

kkraus14 commented Feb 19, 2021

ttnghia commented Mar 7, 2021