Add symmetric_difference #6947

mcrumiller · 2023-02-16T20:30:18Z

Problem description

It's currently clunky to find the difference or symmetric difference between two dataframes. The one-way (asymmetric) difference is pretty easy using an anti-join:

import polars as pl

df1 = pl.DataFrame({
    'a': [1, 2, 3, 4],
    'b': [1, 2, 3, 4]
})
df2 = pl.DataFrame({
    'a': [3, 4, 6, 7],
    'b': [3, 4, 6, 7]
})

# returns items in A that aren't in B
df1.join(df2, on=['a', 'b'], how="anti")
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
└─────┴─────┘

# returns items in B that aren't in A
df2.join(df1, on=['a', 'b'], how="anti")
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 6   ┆ 6   │
│ 7   ┆ 7   │
└─────┴─────┘

so to the get the symmetric difference, we can combine these two, but it's not too pretty, and with a lot of duplicates I imagine it's not as performant as it could be:

# use dual anti-joins
df1.join(df2, on=['a', 'b'], how="anti").vstack(df2.join(df1, on=['a', 'b'], how="anti")).unique()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 6   ┆ 6   │
│ 7   ┆ 7   │
└─────┴─────┘

A simple df1.symmetric_difference(df2) would be nice. I also think a simple df1.difference(df2) as a shorthand for an anti-join would be nice as well.

But regardless, any suggestions for a better solution would be appreciated.

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-02-16T20:38:47Z

I note another solution from this stackoverflow question using my above dataframes, modified a bit because that answer would fail if there are any duplicates in either frame, and the row count is unnecessary:

pl.concat((df1.unique(), df2.unique())).filter(pl.count().over(['a', 'b']) == 1)
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 6   ┆ 6   │
│ 7   ┆ 7   │
└─────┴─────┘

which may be more performant.

ghuls · 2023-02-16T20:42:52Z

Do an outer join followed by a filter.

df1.with_columns(
    pl.lit(True).alias("df1")
)
.join(
    df2.with_columns(
        pl.lit(True).alias("df2")
    ),
    on=['a', 'b'], 
    how="outer"
).filter(
    pl.col("df1") != pl.col("df2")
)


shape: (4, 4)
┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ df1  ┆ df2  │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ 6   ┆ 6   ┆ null ┆ true │
│ 7   ┆ 7   ┆ null ┆ true │
│ 1   ┆ 1   ┆ true ┆ null │
│ 2   ┆ 2   ┆ true ┆ null │
└─────┴─────┴──────┴──────┘

mcrumiller · 2023-02-16T21:43:12Z

@ghuls that sounds like a memory disaster waiting to happen.

ghuls · 2023-02-17T14:17:28Z

@ghuls that sounds like a memory disaster waiting to happen.

If you do it in lazy mode with streaming, it shouldn't be a problem:

#5339

mcrumiller added the enhancement New feature or an improvement of an existing feature label Feb 16, 2023

mcrumiller changed the title ~~Add df1.symmetric difference(df2)~~ Add symmetric_difference Feb 16, 2023

mcrumiller mentioned this issue Jul 16, 2023

Request for more set operations (is_disjoint, is_subset,is_superset) #9908

Open

mcrumiller mentioned this issue Dec 1, 2023

Add set operations (intersection, difference, union, ...) to Series and Expr #12806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add symmetric_difference #6947

Add symmetric_difference #6947

mcrumiller commented Feb 16, 2023 •

edited

Loading

mcrumiller commented Feb 16, 2023

ghuls commented Feb 16, 2023

mcrumiller commented Feb 16, 2023

ghuls commented Feb 17, 2023

Add symmetric_difference #6947

Add symmetric_difference #6947

Comments

mcrumiller commented Feb 16, 2023 • edited Loading

Problem description

mcrumiller commented Feb 16, 2023

ghuls commented Feb 16, 2023

mcrumiller commented Feb 16, 2023

ghuls commented Feb 17, 2023

mcrumiller commented Feb 16, 2023 •

edited

Loading