Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add symmetric_difference #6947

Open
mcrumiller opened this issue Feb 16, 2023 · 4 comments
Open

Add symmetric_difference #6947

mcrumiller opened this issue Feb 16, 2023 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 16, 2023

Problem description

It's currently clunky to find the difference or symmetric difference between two dataframes. The one-way (asymmetric) difference is pretty easy using an anti-join:

import polars as pl

df1 = pl.DataFrame({
    'a': [1, 2, 3, 4],
    'b': [1, 2, 3, 4]
})
df2 = pl.DataFrame({
    'a': [3, 4, 6, 7],
    'b': [3, 4, 6, 7]
})

# returns items in A that aren't in B
df1.join(df2, on=['a', 'b'], how="anti")
shape: (2, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
│ 11   │
│ 22   │
└─────┴─────┘

# returns items in B that aren't in A
df2.join(df1, on=['a', 'b'], how="anti")
shape: (2, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
│ 66   │
│ 77   │
└─────┴─────┘

so to the get the symmetric difference, we can combine these two, but it's not too pretty, and with a lot of duplicates I imagine it's not as performant as it could be:

# use dual anti-joins
df1.join(df2, on=['a', 'b'], how="anti").vstack(df2.join(df1, on=['a', 'b'], how="anti")).unique()
shape: (4, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
│ 11   │
│ 22   │
│ 66   │
│ 77   │
└─────┴─────┘

A simple df1.symmetric_difference(df2) would be nice. I also think a simple df1.difference(df2) as a shorthand for an anti-join would be nice as well.

But regardless, any suggestions for a better solution would be appreciated.

@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Feb 16, 2023
@mcrumiller
Copy link
Contributor Author

I note another solution from this stackoverflow question using my above dataframes, modified a bit because that answer would fail if there are any duplicates in either frame, and the row count is unnecessary:

pl.concat((df1.unique(), df2.unique())).filter(pl.count().over(['a', 'b']) == 1)
shape: (4, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
│ 11   │
│ 22   │
│ 66   │
│ 77   │
└─────┴─────┘

which may be more performant.

@ghuls
Copy link
Collaborator

ghuls commented Feb 16, 2023

Do an outer join followed by a filter.

df1.with_columns(
    pl.lit(True).alias("df1")
)
.join(
    df2.with_columns(
        pl.lit(True).alias("df2")
    ),
    on=['a', 'b'], 
    how="outer"
).filter(
    pl.col("df1") != pl.col("df2")
)


shape: (4, 4)
┌─────┬─────┬──────┬──────┐
│ abdf1df2  │
│ ------------  │
│ i64i64boolbool │
╞═════╪═════╪══════╪══════╡
│ 66nulltrue │
│ 77nulltrue │
│ 11truenull │
│ 22truenull │
└─────┴─────┴──────┴──────┘

@mcrumiller
Copy link
Contributor Author

@ghuls that sounds like a memory disaster waiting to happen.

@mcrumiller mcrumiller changed the title Add df1.symmetric difference(df2) Add symmetric_difference Feb 16, 2023
@ghuls
Copy link
Collaborator

ghuls commented Feb 17, 2023

@ghuls that sounds like a memory disaster waiting to happen.

If you do it in lazy mode with streaming, it shouldn't be a problem:

#5339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants