Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct wildcard and input expansion for some more functions #19588

Merged
merged 1 commit into from
Nov 4, 2024

Conversation

siddharth-vi
Copy link
Contributor

Fixes #18795

Continuing from the previous PR (#19449), have fixed wildcard/input expansion for a few more functions by removing INPUT_WILDCARD_EXPANSION -

pl.datetime

Before -

df = pl.DataFrame({"a": [1], "b": [2]})
print(df.select(pl.datetime(year=pl.all(),month=pl.all(),day=pl.all()).name.keep() ))
shape: (1, 1)
┌─────────────────────┐
│ a                   │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 0001-02-01 02:01:02 │
└─────────────────────┘

After -

df = pl.DataFrame({"a": [1], "b": [2]})
print(df.select(pl.datetime(year=pl.all(),month=pl.all(),day=pl.all()).name.keep() ))
shape: (1, 2)
┌─────────────────────┬─────────────────────┐
│ a                   ┆ b                   │
│ ---                 ┆ ---                 │
│ datetime[μs]        ┆ datetime[μs]        │
╞═════════════════════╪═════════════════════╡
│ 0001-01-01 00:00:00 ┆ 0002-02-02 00:00:00 │
└─────────────────────┴─────────────────────┘

arr.count_matches

Before -

df = pl.DataFrame({"a": [[1, 2]], "b": [[3, 4]]},schema={"a": pl.Array(pl.Int64, 2),"b": pl.Array(pl.Int64, 2)})
print(df.select(pl.all().arr.count_matches(3)))
PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `Int64`, got `array[i64, 2]`"))

After -

df = pl.DataFrame({"a": [[1, 2]], "b": [[3, 4]]},schema={"a": pl.Array(pl.Int64, 2),"b": pl.Array(pl.Int64, 2)})
print(df.select(pl.all().arr.count_matches(3)))
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0   ┆ 1   │
└─────┴─────┘

list.set_*

Before -

df = pl.DataFrame(
    {"a": [[1, 2, 3], [1, 1, 1], [4]], "b": [[4, 2, 1], [2, 1, 12], [4]],"c":[[1,2],[2,1,76],[8,9]]}
)
print(df.select(pl.col("a", "b").list.set_intersection("c")) )
shape: (3, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [1]       │
│ [4]       │
└───────────┘

After -

df = pl.DataFrame(
    {"a": [[1, 2, 3], [1, 1, 1], [4]], "b": [[4, 2, 1], [2, 1, 12], [4]],"c":[[1,2],[2,1,76],[8,9]]}
)
print(df.select(pl.col("a", "b").list.set_intersection("c")) )
shape: (3, 2)
┌───────────┬───────────┐
│ a         ┆ b         │
│ ---       ┆ ---       │
│ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╡
│ [1, 2]    ┆ [2, 1]    │
│ [1]       ┆ [2, 1]    │
│ []        ┆ []        │
└───────────┴───────────┘

Similar changes for other .list.set_* functions

Notes

  • One thing I noticed in the code was that the INPUT_WILDCARD_EXPANSION flag is used not only for specifying wildcard expansion but also for specifying how to deal with multiple input columns. Perhaps the name of the flag needs to be changed( INPUT_EXPANSION ? ) or atleast we can document it.
  • There are still a couple of places where INPUT_WILDCARD_EXPANSION might be wrongly set , but I was not able to figure out how to deal with those -
    1. .list.concat - here the code is shared with pl.concat_list, for the former we should remove the flag but we need it for the latter
    1. .struct.with_fields - Not sure how to deal with cases like pl.col(a,b).struct.with_fields(c,d) where there are multiple input columns on both sides.

@siddharth-vi siddharth-vi changed the title fix: Correct wildcard/input expansion for some more functions fix: Correct wildcard and input expansion for some more functions Nov 1, 2024
@siddharth-vi siddharth-vi changed the title fix: Correct wildcard and input expansion for some more functions fix: Correct wildcard and input expansion for some more functions Nov 1, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Nov 1, 2024
@siddharth-vi
Copy link
Contributor Author

@cmdlineluser FYI

@ritchie46
Copy link
Member

One thing I noticed in the code was that the INPUT_WILDCARD_EXPANSION flag is used not only for specifying wildcard expansion but also for specifying how to deal with multiple input columns. Perhaps the name of the flag needs to be changed( INPUT_EXPANSION ? ) or atleast we can document it.

Sure, we could rename it. Originally we only expanded wildcards, but it is quite a bit more nowadays. :)

.list.concat - here the code is shared with pl.concat_list, for the former we should remove the flag but we need it for the latter

Yes, they probably should be split at a high level (they should still dispatch to the same implementation when expanded)

.struct.with_fields - Not sure how to deal with cases like pl.col(a,b).struct.with_fields(c,d) where there are multiple input columns on both sides.

We don't support that yet. It adds a lot of complexity. In that case we only want to expand the pl.col case.

In any case, thanks a lot for the fixes! 🙌

@ritchie46 ritchie46 merged commit e52a598 into pola-rs:main Nov 4, 2024
25 of 28 checks passed
@cmdlineluser
Copy link
Contributor

Thanks again @siddharth-vi

Coincidentally, I just ran into similar behaviour with pl.format which I'm not too sure about:

df = pl.DataFrame({"a": ["foo", "bar"], "b": [1, 2], "c": ["py", "pl"]})

df.with_columns(
    pl.format("[{}]", pl.col(pl.String)).name.keep()
)

# shape: (2, 3)
# ┌─────────┬─────┬─────┐
# │ a       ┆ b   ┆ c   │
# │ ---     ┆ --- ┆ --- │
# │ str     ┆ i64 ┆ str │
# ╞═════════╪═════╪═════╡
# │ [foopy] ┆ 1   ┆ py  │ # should this be [foo] ┆ 1 ┆ [py] ?
# │ [barpl] ┆ 2   ┆ pl  │ # should this be [bar] ┆ 2 ┆ [pl] ?
# └─────────┴─────┴─────┘

Makes me wonder if potentially a pl.args() (pl.kwargs()) could exist to allow the user to control when to do this.

tylerriccio33 pushed a commit to tylerriccio33/polars that referenced this pull request Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

list.set_intersection operates on the wrong columns when multiple columns are selected with pl.col
3 participants