Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add Python bindings for drop_list_duplicates #7414

Closed
argenisleon opened this issue Feb 19, 2021 · 2 comments · Fixed by #7664
Closed

[FEA] Add Python bindings for drop_list_duplicates #7414

argenisleon opened this issue Feb 19, 2021 · 2 comments · Fixed by #7664
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@argenisleon
Copy link

Is your feature request related to a problem? Please describe.
I wish I could deduplicate the elements in a column list. For example, applying the function to this dataframe:
0 [optimus, a, optimus]
1 [optimus, b]
2 [optimus, c]
Name: A, dtype: list

should output
0 [optimus, a]
1 [optimus, b]
2 [optimus, c]
Name: A, dtype: list

Describe the solution you'd like
In pandas you could use someting like df.apply(lambda x: set(x['column_list']), axis=1). No sure which could be the best approach in cudf

Describe alternatives you've considered

text = ["Optimus a", 
        "Optimus b Optimus",
       "Optims c"]
df = cudf.DataFrame({'text':text,'doc_id':[0,1,2]})
tokens = df.text.str.tokenize(' ')
tk_cnts = df.text.str.token_count(' ')
global_pos = df.doc_id.repeat(tk_cnts).reset_index(drop=True)

tokenized_df = cudf.DataFrame({'token':tokens,'doc_id':global_pos})
tokenized_df['token_order']=1
tokenized_df['token_order'] = tokenized_df['token_order'].cumsum()

tokenized_df.columns = ['token_ser', 'output_column_id_ser', 'ouput_token_order_ser']
tokenized_df= tokenized_df.drop_duplicates(["token_ser","output_column_id_ser"])
tokenized_df["token_ser"] = tokenized_df["token_ser"].str.detokenize('output_column_id_ser')
@argenisleon argenisleon added Needs Triage Need team to review and classify feature request New feature or request labels Feb 19, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Feb 19, 2021
@kkraus14
Copy link
Collaborator

I think this would make sense to expose as something like cudf.Series.list.unique() in cuDF Python

@ttnghia
Copy link
Contributor

ttnghia commented Mar 7, 2021

We have a new PR for drop_list_duplicates in cudf C++: #7528
Please work on the python side if necessary. Thanks.

rapids-bot bot pushed a commit that referenced this issue Mar 12, 2021
Closes #7494 and partially addresses #7414.

This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last).

Example with null_equality=EQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }

```

Example with null_equality=UNEQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} }

```

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - @nvdbaranec
  - David (@davidwendt)
  - Keith Kraus (@kkraus14)

URL: #7528
@randerzander randerzander changed the title [FEA] Deduplicate element in column lists [FEA] Add Python bindings for drop_list_duplicates Mar 12, 2021
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
Closes rapidsai#7494 and partially addresses rapidsai#7414.

This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last).

Example with null_equality=EQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }

```

Example with null_equality=UNEQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} }

```

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - @nvdbaranec
  - David (@davidwendt)
  - Keith Kraus (@kkraus14)

URL: rapidsai#7528
rapids-bot bot pushed a commit that referenced this issue Mar 31, 2021
Closes #7414 

This PR adds `list.unique` API. Following `Series.unique` behavior, this API treats null values as equal, and treats all nans as equal. This API does not guarantee the order of list elements. Example:

```python
>>> s = cudf.Series([[1, 1, 2, None, None], None, [np.nan, np.nan], []])
>>> s.list.unique() # Order of list elements is not gaurenteed
0              [1.0, 2.0, nan]
1                         None
2                        [nan]
3                           []
dtype: list
```

Authors:
  - Michael Wang (@isVoid)

Approvers:
  - Keith Kraus (@kkraus14)
  - Nghia Truong (@ttnghia)

URL: #7664
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants