Add stable `Expr.top_k` #16596

MarcoGorelli · 2024-05-30T12:51:04Z

Description

In #10054, there was a request for a way to answer the query:

For each group 'd', find the rows corresponding to the top k values from column 'b'

One possible API could have been: df.top_k(k=k, by='b', group_by='d'), or, as the OP suggested, df.group_by('d').top_k(k=k, by='b').

The response was that

df.group_by('d').agg(pl.all().top_k(k=1, by='b'))

is enough, and that's what's currently suggested in the top_k docs

However, as there's no ordering guarantees, then if there's ties in the by column, then the risk is that this solution produces a result with rows which never appeared in the original dataframe: #10054 (comment)

This was discussed in #15238, and the suggestion is now to introduce a stable Expr.top_k. This would solve the original issue

The text was updated successfully, but these errors were encountered:

MarcoGorelli added the enhancement New feature or an improvement of an existing feature label May 30, 2024

stinodego added accepted Ready for implementation A-ops Area: operations labels May 30, 2024

github-project-automation bot added this to Backlog May 30, 2024

github-project-automation bot moved this to Ready in Backlog May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stable `Expr.top_k` #16596

Add stable `Expr.top_k` #16596

MarcoGorelli commented May 30, 2024

Add stable Expr.top_k #16596

Add stable Expr.top_k #16596

Comments

MarcoGorelli commented May 30, 2024

Description

Add stable `Expr.top_k` #16596

Add stable `Expr.top_k` #16596