Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Slice in Polars Rolling Agg #17198

Open
2 tasks done
jackaixin opened this issue Jun 25, 2024 · 2 comments
Open
2 tasks done

Using Slice in Polars Rolling Agg #17198

jackaixin opened this issue Jun 25, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@jackaixin
Copy link

jackaixin commented Jun 25, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame({
    't_idx': [1 ,2 ,3, 4, 5, 6],
}).rolling('t_idx', period='5i').agg(
    -pl.col('t_idx').len().cast(pl.Int64).alias('start'),
    pl.col('t_idx').count().alias('end'),
    pl.lit(pl.Series([.1, .2, .3, .4, .5])).slice(-pl.col('t_idx').len().cast(pl.Int64), pl.col('t_idx').count()).alias('weights')
)

Log output

thread 'polars-4' panicked at crates/polars-core/src/frame/group_by/aggregations/agg_list.rs:109:58:
range end index 6 out of range for slice of length 5
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[828], line 3
      1 pl.DataFrame({
      2     't_idx': [1 ,2 ,3, 4, 5, 6],
----> 3 }).rolling('t_idx', period='5i').agg(
      4     -pl.col('t_idx').len().cast(pl.Int64).alias('start'),
      5     pl.col('t_idx').count().alias('end'),
      6     pl.lit(pl.Series([.1, .2, .3, .4, .5])).slice(-pl.col('t_idx').len().cast(pl.Int64), pl.col('t_idx').count()).alias('new')
      7 )

File ~/virtual_environments/vve_3_11_6/lib/python3.11/site-packages/polars/dataframe/group_by.py:896, in RollingGroupBy.agg(self, *aggs, **named_aggs)
    868 def agg(
    869     self,
    870     *aggs: IntoExpr | Iterable[IntoExpr],
    871     **named_aggs: IntoExpr,
    872 ) -> DataFrame:
    873     """
    874     Compute aggregations for each group of a group by operation.
    875 
   (...)
    884         The resulting columns will be renamed to the keyword used.
    885     """
    886     return (
    887         self.df.lazy()
    888         .rolling(
    889             index_column=self.time_column,
    890             period=self.period,
    891             offset=self.offset,
    892             closed=self.closed,
    893             group_by=self.group_by,
    894         )
    895         .agg(*aggs, **named_aggs)
--> 896         .collect(no_optimization=True)
    897     )

File ~/virtual_environments/vve_3_11_6/lib/python3.11/site-packages/polars/lazyframe/frame.py:1967, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1964 # Only for testing purposes atm.
   1965 callback = _kwargs.get("post_opt_callback")
-> 1967 return wrap_df(ldf.collect(callback))

PanicException: range end index 6 out of range for slice of length 5

Issue description

I was trying to slice a weight Series based on the length of the rolling window. Use cases can be e.g. we want to apply some custom weights for rolling average. (I understand there's already a rolling_mean implemented.)

Let's say my rolling window has a length of 5. The first 4 rows will have a length of 1, 2, 3, 4 and therefore I only want to get weights that fit the rolling window's length. From the fifth row on, I expect the weights to be always [.1, .2, .3, .4, .5].

Expected behavior

I ran the following code

pl.DataFrame({
    't_idx': [1 ,2 ,3, 4, 5],
}).rolling('t_idx', period='5i').agg(
    -pl.col('t_idx').len().cast(pl.Int64).alias('start'),
    pl.col('t_idx').count().alias('end'),
    pl.lit(pl.Series([.1, .2, .3, .4, .5])).slice(-pl.col('t_idx').len().cast(pl.Int64), pl.col('t_idx').count()).alias('weights')
)

I expect the following:

shape: (5, 4)
┌───────┬───────┬─────┬───────────────────┐
│ t_idx ┆ start ┆ end ┆ new               │
│ ---   ┆ ---   ┆ --- ┆ ---               │
│ i64   ┆ i64   ┆ u32 ┆ list[f64]         │
╞═══════╪═══════╪═════╪═══════════════════╡
│ 1     ┆ -1    ┆ 1   ┆ [0.4]             │
│ 2     ┆ -2    ┆ 2   ┆ [0.4, 0.5]        │
│ 3     ┆ -3    ┆ 3   ┆ [0.3, 0.4, 0.5]   │
│ 4     ┆ -4    ┆ 4   ┆ [0.2, 0.3, … 0.5] │
│ 5     ┆ -5    ┆ 5   ┆ [0.1, 0.2, … 0.5] │
└───────┴───────┴─────┴───────────────────┘

However what I got was:

shape: (5, 4)
┌───────┬───────┬─────┬───────────────────┐
│ t_idx ┆ start ┆ end ┆ new               │
│ ---   ┆ ---   ┆ --- ┆ ---               │
│ i64   ┆ i64   ┆ u32 ┆ list[f64]         │
╞═══════╪═══════╪═════╪═══════════════════╡
│ 1     ┆ -1    ┆ 1   ┆ [0.1]             │
│ 2     ┆ -2    ┆ 2   ┆ [0.1, 0.2]        │
│ 3     ┆ -3    ┆ 3   ┆ [0.1, 0.2, 0.3]   │
│ 4     ┆ -4    ┆ 4   ┆ [0.1, 0.2, … 0.4] │
│ 5     ┆ -5    ┆ 5   ┆ [0.1, 0.2, … 0.5] │
└───────┴───────┴─────┴───────────────────┘

When I extended my original df to 6 rows, I got the PanicException: range end index 6 out of range for slice of length 5 exception in the log output above.

pl.DataFrame({
    't_idx': [1 ,2 ,3, 4, 5, 6],
}).rolling('t_idx', period='5i').agg(
    -pl.col('t_idx').len().cast(pl.Int64).alias('start'),
    pl.col('t_idx').count().alias('end'),
    pl.lit(pl.Series([.1, .2, .3, .4, .5])).slice(-pl.col('t_idx').len().cast(pl.Int64), pl.col('t_idx').count()).alias('weights')
)

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.6.0
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@jackaixin jackaixin added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 25, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce on 1.0.0-rc.2

pl.DataFrame({
    'idx': [1, 2]
}).rolling('idx', period='2i').agg(
    pl.lit(pl.Series([1])).slice(-1, 1)
)

# PanicException: range end index 2 out of range for slice of length 1

Not sure if it is the same underlying issue, but if I remove the pl.Series I have to kill the process - it seems to go into a memory explosion loop.

pl.DataFrame({
    'idx': [1, 2]
}).rolling('idx', period='2i').agg(
    pl.lit([1]).slice(-1, 1)
)
# never returns

@cmdlineluser
Copy link
Contributor

Your example now runs on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants