-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set operations (intersection, difference, union, ...) to Series and Expr #12806
Comments
There was a recent SO question where the user wanted the "group intersection". https://stackoverflow.com/questions/77544923/ It did make me wonder if they could exist as "top-level" functions e.g. pl.set_intersection(s1, s2, s3, s4) (more than 2 inputs for intersection + union seems useful) |
@cmdlineluser regarding that question I made this issue. Your I think if your syntax existed as a shortcut to the reduction that would be more intuitive for sure but the reduce does save you from manually doing nested |
These are... just joins. Intersection is an inner join, difference is a antijoin, union is an outer join + coalesce. s1 = pl.Series([1, 2, 3])
s2 = pl.Series([1, 2, 7, 8])
intersection = s1.to_frame("x").join(s2.to_frame("x"), on="x", how="inner")
union = s1.to_frame("x").join(s2.to_frame("x"), on="x", how="outer").select(pl.coalesce(pl.all()))
diff = s1.to_frame("x").join(s2.to_frame("x"), on="x", how="anti")
print(intersection) # 1, 2
print(union) # 1, 2, 7, 8, 3
print(diff) # 3 |
Well, this is no real argument. One could also say there is no need for polars because you can just use python/rust/c++ or do it in machine code. Using Image working in a 10-20 people cross functional team where many different people will have a look at the code. |
@JulianCologne My point is a reply to the original poster, which claims "these set operations are only available in the list namespace". That's not true, these operations are efficiently available on DataFrames without having to go through lists, because these operations are in fact joins. We could consider adding syntactic sugar for them. |
my 2 cents is that For Exprs in a dataframe, I'm not so convinced that it makes sense to infer the implode. Surely if someone sees
they'll see the IMO, It makes more sense for people to infer things than for the code base to do it. If someone wants to do something with nested lists then the built in inference might mess that up. |
This request has been around for a long time. See #9908, #7647, #6947, #9908, etc. No need to argue on its importance, it's a super requested feature, it would definitely be nice, and I myself have had use for it. You can currently get around it if you have to and it's good to keep "how to"s near the top. In fact, I wonder if we should have a "how-to" section in the documentation, or in issue pages somehow, for features that aren't-yet implemented but have a lot of demand, like this one or the I would love to help but I have a full-time job and I just haven't been able to learn rust or the code base well enough to help out in any significant manner. We have only a few rust powerhouses in here (@orlp is one of them) and it's tough to have to rely on them for everything. |
Description
Currently the set operations are only available in the
list
namespace which is usually used with colunms of alist[...]
type.However these operations are also very useful directly on the Series and Expr level.
See examples below:
Having the set operations direcly available would clean up the code a lot and make it much more readable and intuitive
Series
s1.implode().list.set_intersection(s2.implode())
s1.set_intersection(s2)
Expr (DataFrame)
pl.col("x1").implode().list.set_intersection(pl.col("x2").implode())
pl.col("x1").set_intersection("x2")
The text was updated successfully, but these errors were encountered: