Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list.len() reports len of a list that no longer exists #19987

Closed
2 tasks done
aofarrel opened this issue Nov 25, 2024 · 2 comments · Fixed by #19999
Closed
2 tasks done

list.len() reports len of a list that no longer exists #19987

aofarrel opened this issue Nov 25, 2024 · 2 comments · Fixed by #19999
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@aofarrel
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.DataFrame({
	"a": [[1], None],
	"b": [[2, 4], [4, 5, None]]
})
df = df.with_columns(a_len=pl.col("a").list.len())
df = df.with_columns(b_len=pl.col("b").list.len())
print("This is as expected, since list.len() counts nulls inside of existing lists")
print(df.select(['a', 'a_len', 'b', 'b_len']))

df = df.with_columns([
	pl.when(pl.col("a").list.len() <= 1).then(pl.col("a")).otherwise(None).alias("nulled_a"),
	pl.when(pl.col("b").list.len() <= 1).then(pl.col("b")).otherwise(None).alias("nulled_b"),
])
df = df.with_columns(nulled_a_len=pl.col("nulled_a").list.len())
df = df.with_columns(nulled_b_len=pl.col("nulled_b").list.len())
print("\n\nThis is weird, since it seems that list.len() is counting values that no longer exist")
print(df.select(['nulled_a', 'nulled_a_len', 'nulled_b', 'nulled_b_len']))

Log output

This is as expected, since list.len() counts nulls
shape: (2, 4)
┌───────────┬───────┬──────────────┬───────┐
│ a         ┆ a_len ┆ b            ┆ b_len │
│ ---       ┆ ---   ┆ ---          ┆ ---   │
│ list[i64] ┆ u32   ┆ list[i64]    ┆ u32   │
╞═══════════╪═══════╪══════════════╪═══════╡
│ [1]       ┆ 1     ┆ [2, 4]       ┆ 2     │
│ null      ┆ 0     ┆ [4, 5, null] ┆ 3     │
└───────────┴───────┴──────────────┴───────┘


This is weird, since it seems that list.len() is counting values that no longer exist
shape: (2, 4)
┌───────────┬──────────────┬───────────┬──────────────┐
│ nulled_a  ┆ nulled_a_len ┆ nulled_b  ┆ nulled_b_len │
│ ---       ┆ ---          ┆ ---       ┆ ---          │
│ list[i64] ┆ u32          ┆ list[i64] ┆ u32          │
╞═══════════╪══════════════╪═══════════╪══════════════╡
│ [1]       ┆ 1            ┆ null      ┆ 2            │
│ null      ┆ 0            ┆ null      ┆ 3            │
└───────────┴──────────────┴───────────┴──────────────┘

Issue description

Completely overwriting a list as pl.Null results in the list's length being considered what is was prior to the overwrite.

Related but not quite the same: #18522

My use case: I am compiling genomic metadata across about a dozen studies. A lot of studies have internally conflicting metadata due to types, eg saying sample "SAMN0001" is both resistant and not-resistant to some antibiotic -- if an antibiotic's column is a list of length or two or more, I want to throw out that row's metadata by overwriting the list at that column with pl.Null. Right now, it seems that can't be done (but there might be a workaround by chaining a few more expressions?)

Expected behavior

A list's length should only be the length of what's actually in it, including null values. When a list is overwritten to be a single pl.Null value, the length of that "list" should be 0, not what it was prior. In other words:

┌───────────┬──────────────┬───────────┬──────────────┐
│ nulled_a  ┆ nulled_a_len ┆ nulled_b  ┆ nulled_b_len │
│ ---       ┆ ---          ┆ ---       ┆ ---          │
│ list[i64] ┆ u32          ┆ list[i64] ┆ u32          │
╞═══════════╪══════════════╪═══════════╪══════════════╡
│ [1]       ┆ 1            ┆ null      ┆ 0            │
│ null      ┆ 0            ┆ null      ┆ 0            │
└───────────┴──────────────┴───────────┴──────────────┘

If pl.when(pl.col("b").list.len() <= 1).then(pl.col("b")).otherwise(None).alias("nulled_b") is actually setting the value to something like [pl.Null, pl.Null] instead of pl.Null, that raises some additional issues:

  • [pl.Null, pl.Null] is being printed as null instead of [null, null] which isn't clear, nor is consistent with how [1, 2, null] gets printed
  • It seems to imply there isn't a way to set a list into something with length 0, unless it was defined that way during dataframe creation (like the second row of "a" in the example)

Installed versions

--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            macOS-13.6.7-x86_64-i386-64bit
Python:              3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          1.35.0
great_tables         <not installed>
matplotlib           3.8.2
nest_asyncio         <not installed>
numpy                1.25.2
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@aofarrel aofarrel added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 25, 2024
@MarcoGorelli
Copy link
Collaborator

thanks @aofarrel for the report

In [3]: df['nulled_b']
Out[3]:
shape: (2,)
Series: 'nulled_b' [list[i64]]
[
        null
        null
]

In [4]: df['nulled_b'].list.len()
Out[4]:
shape: (2,)
Series: 'nulled_b' [u32]
[
        2
        3
]

🤔 this does look very odd, gonna mark as high prio

@MarcoGorelli MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Nov 25, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Nov 25, 2024
@nameexhaustion nameexhaustion self-assigned this Nov 26, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Nov 26, 2024
@aofarrel
Copy link
Author

Thanks for the quick fix! Much appreciated.

@c-peters c-peters added the accepted Ready for implementation label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants