Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum in window function deals with NULL's differently after upgrade to latest polars version #10387

Closed
2 tasks done
MatthiasRoels opened this issue Aug 9, 2023 · 5 comments
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@MatthiasRoels
Copy link

MatthiasRoels commented Aug 9, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

It is not really a bug, but more a change in behaviour when running the code below:

import polars as pl

df = pl.DataFrame(
    {
        "id": [1, 2, 2, 3, 3],
        "val": [None, 1, None, 2, 2]
    }
)

result = df.with_columns(pl.col("val").sum().over("id").alias("windowed_sum"))
result
>> shape: (5, 3)
┌─────┬──────┬──────────────┐
│ idvalwindowed_sum │
│ ---------          │
│ i64i64i64          │
╞═════╪══════╪══════════════╡
│ 1null0            │
│ 211            │
│ 2null1            │
│ 324            │
│ 324            │
└─────┴──────┴──────────────┘

Issue description

I noticed a change in behaviour of how NULL values are handled in a sum over a partition after I upgraded from polars 0.18.4 to the latest version. The new behaviour is different from e.g. pyspark which has the same behaviour as we had in the old version. I was wondering if this was intentional or if this is a regression introduced in one of the later versions (after 0.18.4)?

Expected behavior

When running the snippet above, I was expecting to see the following result (note the difference in the last column of the first row in the result)

import polars as pl

df = pl.DataFrame(
    {
        "id": [1, 2, 2, 3, 3],
        "val": [None, 1, None, 2, 2]
    }
)

result = df.with_columns(pl.col("val").sum().over("id").alias("windowed_sum"))
result
>> shape: (5, 3)
┌─────┬──────┬──────────────┐
│ idvalwindowed_sum │
│ ---------          │
│ i64i64i64          │
╞═════╪══════╪══════════════╡
│ 1nullnull         │
│ 211            │
│ 2null1            │
│ 324            │
│ 324            │
└─────┴──────┴──────────────┘

Installed versions

New version:

>>> pl.show_versions()
--------Version info---------
Polars:              0.18.13
Index type:          UInt32
Platform:            Linux-5.10.184-175.749.amzn2.x86_64-x86_64-with-glibc2.31
Python:              3.10.12 (main, Jul 28 2023, 05:51:21) [GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.2.1
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
matplotlib:          3.7.1
numpy:               1.23.5
pandas:              1.5.3
pyarrow:             11.0.0
pydantic:            1.10.12
sqlalchemy:          2.0.19
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

Old version:

>>> pl.show_versions()
--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    Linux-5.10.184-175.749.amzn2.x86_64-x86_64-with-glibc2.31
Python:      3.10.12 (main, Jul  4 2023, 06:15:20) [GCC 10.2.1 20210110]

----Optional dependencies----
numpy:       1.23.5
pandas:      1.5.3
pyarrow:     11.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      2023.6.0
matplotlib:  3.7.1
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
>>> 
@MatthiasRoels MatthiasRoels added bug Something isn't working python Related to Python Polars labels Aug 9, 2023
@jpteb
Copy link

jpteb commented Aug 9, 2023

This seems to be intended behavior, see: #5773 and #5604.

I spoke too soon and dug a little deeper, but it still seems to be intended behavior: #9558 and #9576.
It got released with 0.18.5.

@orlp
Copy link
Collaborator

orlp commented Aug 9, 2023

I think this is intended behavior. Regardless of what other systems do, sum() over a column of numbers returning null is nonsense. null indicates missing values, but a sum of a collection can never be 'missing', if the collection is empty or consists of only missing values, the sum is simply zero.

@orlp orlp closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2023
@MatthiasRoels
Copy link
Author

Thanks for the quick reply and confirming it is indeed intentional! Do you then perhaps have a suggestion on how to modify the snippet above so that I get the same result as with v0.18.4?

You can close the issue btw 😉

@orlp
Copy link
Collaborator

orlp commented Aug 9, 2023

@MatthiasRoels

Do you then perhaps have a suggestion on how to modify the snippet above so that I get the same result as with v0.18.4?

def sum_empty_null(col):
    return pl.when((~col.is_null()).any()).then(col.sum())

result = df.with_columns(sum_empty_null(pl.col("val")).over("id"))

@MatthiasRoels
Copy link
Author

Thanks a lot for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants