Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.list.to_struct() has non-deterministic behavior #16450

Open
2 tasks done
kevinli1993 opened this issue May 23, 2024 · 5 comments
Open
2 tasks done

.list.to_struct() has non-deterministic behavior #16450

kevinli1993 opened this issue May 23, 2024 · 5 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@kevinli1993
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame(dict(value=[1,98,2,3,99,4], a=[1,1,1,2,2,1], b=['A','B','A','B','A','B']))

def repeat_me(ds):
    ## Bug: The operation below is non-deterministic
    result = (
        ds
        .group_by("a", "b")
        .agg(pl.col("value").bottom_k(3))
        .with_columns(pl.col("value").list.to_struct(upper_bound=3))
        .unnest('value')
    )
    return result.shape


# Run the function 500 times to see that two behaviors are possible.
possible_shapes = set()
for _ in range(500):
    possible_shapes.add(repeat_me(df))


print(possible_shapes)
## {(4, 3), (4, 4)}

Log output

No response

Issue description

The bug occurs either with upper_bound=3 specified or not.

That is, replacing

        .with_columns(pl.col("value").list.to_struct(upper_bound=3))

with

        .with_columns(pl.col("value").list.to_struct())

will also reproduce the bug.

Expected behavior

The bug is that both of the following outputs are possible.

In my opinion, the shape: (4, 4) ... result is correct; but it is difficult to say what's expected without knowing why the non-determinism occurs in the first place.

df.group_by("a", "b").agg(pl.col("value").bottom_k(3)).with_columns(pl.col("value").list.to_struct(upper_bound=3)).unnest('value')

shape: (4, 3)
┌─────┬─────┬─────────┐
│ a   ┆ b   ┆ field_0 │
│ --- ┆ --- ┆ ---     │
│ i64 ┆ str ┆ i64     │
╞═════╪═════╪═════════╡
│ 2   ┆ A   ┆ 99      │
│ 1   ┆ B   ┆ 4       │
│ 1   ┆ A   ┆ 1       │
│ 2   ┆ B   ┆ 3       │
└─────┴─────┴─────────┘

shape: (4, 4)
┌─────┬─────┬─────────┬─────────┐
│ a   ┆ b   ┆ field_0 ┆ field_1 │
│ --- ┆ --- ┆ ---     ┆ ---     │
│ i64 ┆ str ┆ i64     ┆ i64     │
╞═════╪═════╪═════════╪═════════╡
│ 1   ┆ A   ┆ 1       ┆ 2       │
│ 2   ┆ A   ┆ 99      ┆ null    │
│ 1   ┆ B   ┆ 4       ┆ 98      │
│ 2   ┆ B   ┆ 3       ┆ null    │
└─────┴─────┴─────────┴─────────┘

Installed versions

--------Version info---------
Polars:               0.20.28
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.12.3 (main, Apr 12 2024, 17:16:04) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>


@kevinli1993 kevinli1993 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 23, 2024
@cmdlineluser
Copy link
Contributor

It's happening due to n_field_strategy

Because group_by is returning random order, the length of the first list changes.

df.group_by("a", "b").agg(pl.col("value").bottom_k(3))
# shape: (4, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ value     │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ str ┆ list[i64] │
# ╞═════╪═════╪═══════════╡
# │ 2   ┆ A   ┆ [99]      │
# │ 2   ┆ B   ┆ [3]       │
# │ 1   ┆ A   ┆ [1, 2]    │
# │ 1   ┆ B   ┆ [4, 98]   │
# └─────┴─────┴───────────┘

df.group_by("a", "b").agg(pl.col("value").bottom_k(3))
# shape: (4, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ value     │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ str ┆ list[i64] │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ A   ┆ [1, 2]    │
# │ 2   ┆ B   ┆ [3]       │
# │ 1   ┆ B   ┆ [4, 98]   │
# │ 2   ┆ A   ┆ [99]      │
# └─────┴─────┴───────────┘

You would need n_field_strategy="max_width"

.list.to_struct("max_width", upper_bound=3))

@kevinli1993
Copy link
Author

Makes sense now! I guess this means n_field_strategy=first_non_null is only useful when the number of elements is known to be the same.

Maybe (?) not a bug, but we might consider adding a warning to the documentation since it's a surprising consequence.

@cmdlineluser
Copy link
Contributor

Yeah - it is a bit of a footgun.

There was a Notes section added with an explanation - but perhaps a Warning should also be added pointing to that section.

@kevinli1993
Copy link
Author

kevinli1993 commented May 23, 2024

Agreed.

Would it make sense for .to_struct to accept an argument exact=n, so that the resulting final struct has exactly n fields (filling with null when necessary)?

It would behave similar to str.split and str.split_exact. (Maybe we could create a new function .to_struct_exact.)

@cmdlineluser
Copy link
Contributor

Yeah, something similar came up in #15742 recently.

When n is known, it's effectively:

pl.struct(
    field_0 = pl.col("value").list.get(0),
    field_1 = pl.col("value").list.get(1),
    field_2 = pl.col("value").list.get(2)
)

But not having to type all that out seems like it would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants