Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set operations (intersection, difference, union, ...) to Series and Expr #12806

Open
Julian-J-S opened this issue Nov 30, 2023 · 7 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@Julian-J-S
Copy link
Contributor

Description

Currently the set operations are only available in the list namespace which is usually used with colunms of a list[...] type.
However these operations are also very useful directly on the Series and Expr level.
See examples below:

s1 = pl.Series([1, 2, 3])
s2 = pl.Series([1, 2, 7, 8])
s1.implode().list.set_intersection(s2.implode())
#shape: (1,)
#Series: '' [list[i64]]
#[
#	[1, 2]
#]


pl.DataFrame(
    {
        "x1": [0, 1, 2, 3],
        "x2": [9, 6, 3, 0],
    }
).with_columns(
    intersection=pl.col("x1").implode().list.set_intersection(pl.col("x2").implode())
)
#shape: (4, 3)
#┌─────┬─────┬──────────────┐
#│ x1  ┆ x2  ┆ intersection │
#│ --- ┆ --- ┆ ---          │
#│ i64 ┆ i64 ┆ list[i64]    │
#╞═════╪═════╪══════════════╡
#│ 0   ┆ 9   ┆ [0, 3]       │
#│ 1   ┆ 6   ┆ [0, 3]       │
#│ 2   ┆ 3   ┆ [0, 3]       │
#│ 3   ┆ 0   ┆ [0, 3]       │
#└─────┴─────┴──────────────┘

Having the set operations direcly available would clean up the code a lot and make it much more readable and intuitive

Series

  • OLD: s1.implode().list.set_intersection(s2.implode())
  • NEW: s1.set_intersection(s2)

Expr (DataFrame)

  • OLD: pl.col("x1").implode().list.set_intersection(pl.col("x2").implode())
  • NEW: pl.col("x1").set_intersection("x2")
@Julian-J-S Julian-J-S added the enhancement New feature or an improvement of an existing feature label Nov 30, 2023
@cmdlineluser
Copy link
Contributor

There was a recent SO question where the user wanted the "group intersection". https://stackoverflow.com/questions/77544923/

It did make me wonder if they could exist as "top-level" functions e.g.

pl.set_intersection(s1, s2, s3, s4)

(more than 2 inputs for intersection + union seems useful)

@deanm0000
Copy link
Collaborator

deanm0000 commented Nov 30, 2023

@cmdlineluser regarding that question I made this issue. Your pl.set_intersection(s1, s2, s3, s4) can be handled now with pl.reduce(function=lambda acc, x: acc.list.set_intersection(x), exprs=[s1,s2,s3,s4])

I think if your syntax existed as a shortcut to the reduction that would be more intuitive for sure but the reduce does save you from manually doing nested s1.list.set_intersection(s2).list.set_intersection(s3).list.set_intersection(s4)

@orlp
Copy link
Collaborator

orlp commented Nov 30, 2023

These are... just joins.

Intersection is an inner join, difference is a antijoin, union is an outer join + coalesce.

s1 = pl.Series([1, 2, 3])
s2 = pl.Series([1, 2, 7, 8])

intersection = s1.to_frame("x").join(s2.to_frame("x"), on="x", how="inner")
union = s1.to_frame("x").join(s2.to_frame("x"), on="x", how="outer").select(pl.coalesce(pl.all()))
diff = s1.to_frame("x").join(s2.to_frame("x"), on="x", how="anti")

print(intersection) # 1, 2
print(union) # 1, 2, 7, 8, 3
print(diff) # 3

@Julian-J-S
Copy link
Contributor Author

These are... just joins.

Well, this is no real argument. One could also say there is no need for polars because you can just use python/rust/c++ or do it in machine code.
Libraries like polars are great because they provide an intutive and performant api around common data tasks.

Using s1.union(s2) instead of s1.to_frame("x").join(s2.to_frame("x"), on="x", how="outer").select(pl.coalesce(pl.all())) makes a HUUUUUGEEE difference.

Image working in a 10-20 people cross functional team where many different people will have a look at the code.
It makes a gigantic difference if the code is easy to understand for almost everyone without beeing a expert in polars 😉

@orlp
Copy link
Collaborator

orlp commented Nov 30, 2023

@JulianCologne My point is a reply to the original poster, which claims "these set operations are only available in the list namespace". That's not true, these operations are efficiently available on DataFrames without having to go through lists, because these operations are in fact joins.

We could consider adding syntactic sugar for them.

@deanm0000
Copy link
Collaborator

my 2 cents is that s1.set_union(s2) might make sense if neither s1 nor s2 are themselves a list dtype. So if it's exposed then maybe it's just syntactic sugar for s1.implode().list.set_union(s2.implode()).explode()

For Exprs in a dataframe, I'm not so convinced that it makes sense to infer the implode. Surely if someone sees

intersection=pl.col("x1").implode().list.set_intersection(pl.col("x2").implode())

they'll see the set_intersection and just infer that the implodes are just something that need to be there.

IMO, It makes more sense for people to infer things than for the code base to do it.

If someone wants to do something with nested lists then the built in inference might mess that up.

@mcrumiller
Copy link
Contributor

This request has been around for a long time. See #9908, #7647, #6947, #9908, etc. No need to argue on its importance, it's a super requested feature, it would definitely be nice, and I myself have had use for it. You can currently get around it if you have to and it's good to keep "how to"s near the top. In fact, I wonder if we should have a "how-to" section in the documentation, or in issue pages somehow, for features that aren't-yet implemented but have a lot of demand, like this one or the format() that keeps popping up.

I would love to help but I have a full-time job and I just haven't been able to learn rust or the code base well enough to help out in any significant manner. We have only a few rust powerhouses in here (@orlp is one of them) and it's tough to have to rely on them for everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants