-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add symmetric_difference #6947
Labels
enhancement
New feature or an improvement of an existing feature
Comments
mcrumiller
added
the
enhancement
New feature or an improvement of an existing feature
label
Feb 16, 2023
I note another solution from this stackoverflow question using my above dataframes, modified a bit because that answer would fail if there are any duplicates in either frame, and the row count is unnecessary: pl.concat((df1.unique(), df2.unique())).filter(pl.count().over(['a', 'b']) == 1)
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 6 ┆ 6 │
│ 7 ┆ 7 │
└─────┴─────┘ which may be more performant. |
Do an outer join followed by a filter. df1.with_columns(
pl.lit(True).alias("df1")
)
.join(
df2.with_columns(
pl.lit(True).alias("df2")
),
on=['a', 'b'],
how="outer"
).filter(
pl.col("df1") != pl.col("df2")
)
shape: (4, 4)
┌─────┬─────┬──────┬──────┐
│ a ┆ b ┆ df1 ┆ df2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ 6 ┆ 6 ┆ null ┆ true │
│ 7 ┆ 7 ┆ null ┆ true │
│ 1 ┆ 1 ┆ true ┆ null │
│ 2 ┆ 2 ┆ true ┆ null │
└─────┴─────┴──────┴──────┘ |
@ghuls that sounds like a memory disaster waiting to happen. |
mcrumiller
changed the title
Add df1.symmetric difference(df2)
Add symmetric_difference
Feb 16, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem description
It's currently clunky to find the difference or symmetric difference between two dataframes. The one-way (asymmetric) difference is pretty easy using an anti-join:
so to the get the symmetric difference, we can combine these two, but it's not too pretty, and with a lot of duplicates I imagine it's not as performant as it could be:
A simple
df1.symmetric_difference(df2)
would be nice. I also think a simpledf1.difference(df2)
as a shorthand for an anti-join would be nice as well.But regardless, any suggestions for a better solution would be appreciated.
The text was updated successfully, but these errors were encountered: