list.len() reports len of a list that no longer exists #19987

aofarrel · 2024-11-25T21:20:55Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.DataFrame({
	"a": [[1], None],
	"b": [[2, 4], [4, 5, None]]
})
df = df.with_columns(a_len=pl.col("a").list.len())
df = df.with_columns(b_len=pl.col("b").list.len())
print("This is as expected, since list.len() counts nulls inside of existing lists")
print(df.select(['a', 'a_len', 'b', 'b_len']))

df = df.with_columns([
	pl.when(pl.col("a").list.len() <= 1).then(pl.col("a")).otherwise(None).alias("nulled_a"),
	pl.when(pl.col("b").list.len() <= 1).then(pl.col("b")).otherwise(None).alias("nulled_b"),
])
df = df.with_columns(nulled_a_len=pl.col("nulled_a").list.len())
df = df.with_columns(nulled_b_len=pl.col("nulled_b").list.len())
print("\n\nThis is weird, since it seems that list.len() is counting values that no longer exist")
print(df.select(['nulled_a', 'nulled_a_len', 'nulled_b', 'nulled_b_len']))

Log output

This is as expected, since list.len() counts nulls
shape: (2, 4)
┌───────────┬───────┬──────────────┬───────┐
│ a         ┆ a_len ┆ b            ┆ b_len │
│ ---       ┆ ---   ┆ ---          ┆ ---   │
│ list[i64] ┆ u32   ┆ list[i64]    ┆ u32   │
╞═══════════╪═══════╪══════════════╪═══════╡
│ [1]       ┆ 1     ┆ [2, 4]       ┆ 2     │
│ null      ┆ 0     ┆ [4, 5, null] ┆ 3     │
└───────────┴───────┴──────────────┴───────┘


This is weird, since it seems that list.len() is counting values that no longer exist
shape: (2, 4)
┌───────────┬──────────────┬───────────┬──────────────┐
│ nulled_a  ┆ nulled_a_len ┆ nulled_b  ┆ nulled_b_len │
│ ---       ┆ ---          ┆ ---       ┆ ---          │
│ list[i64] ┆ u32          ┆ list[i64] ┆ u32          │
╞═══════════╪══════════════╪═══════════╪══════════════╡
│ [1]       ┆ 1            ┆ null      ┆ 2            │
│ null      ┆ 0            ┆ null      ┆ 3            │
└───────────┴──────────────┴───────────┴──────────────┘

Issue description

Completely overwriting a list as pl.Null results in the list's length being considered what is was prior to the overwrite.

Related but not quite the same: #18522

My use case: I am compiling genomic metadata across about a dozen studies. A lot of studies have internally conflicting metadata due to types, eg saying sample "SAMN0001" is both resistant and not-resistant to some antibiotic -- if an antibiotic's column is a list of length or two or more, I want to throw out that row's metadata by overwriting the list at that column with pl.Null. Right now, it seems that can't be done (but there might be a workaround by chaining a few more expressions?)

Expected behavior

A list's length should only be the length of what's actually in it, including null values. When a list is overwritten to be a single pl.Null value, the length of that "list" should be 0, not what it was prior. In other words:

┌───────────┬──────────────┬───────────┬──────────────┐
│ nulled_a  ┆ nulled_a_len ┆ nulled_b  ┆ nulled_b_len │
│ ---       ┆ ---          ┆ ---       ┆ ---          │
│ list[i64] ┆ u32          ┆ list[i64] ┆ u32          │
╞═══════════╪══════════════╪═══════════╪══════════════╡
│ [1]       ┆ 1            ┆ null      ┆ 0            │
│ null      ┆ 0            ┆ null      ┆ 0            │
└───────────┴──────────────┴───────────┴──────────────┘

If pl.when(pl.col("b").list.len() <= 1).then(pl.col("b")).otherwise(None).alias("nulled_b") is actually setting the value to something like [pl.Null, pl.Null] instead of pl.Null, that raises some additional issues:

[pl.Null, pl.Null] is being printed as null instead of [null, null] which isn't clear, nor is consistent with how [1, 2, null] gets printed
It seems to imply there isn't a way to set a list into something with length 0, unless it was defined that way during dataframe creation (like the second row of "a" in the example)

Installed versions

--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            macOS-13.6.7-x86_64-i386-64bit
Python:              3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          1.35.0
great_tables         <not installed>
matplotlib           3.8.2
nest_asyncio         <not installed>
numpy                1.25.2
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-11-25T21:53:14Z

thanks @aofarrel for the report

In [3]: df['nulled_b']
Out[3]:
shape: (2,)
Series: 'nulled_b' [list[i64]]
[
        null
        null
]

In [4]: df['nulled_b'].list.len()
Out[4]:
shape: (2,)
Series: 'nulled_b' [u32]
[
        2
        3
]

🤔 this does look very odd, gonna mark as high prio

aofarrel · 2024-11-26T18:52:36Z

Thanks for the quick fix! Much appreciated.

aofarrel added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 25, 2024

MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Nov 25, 2024

github-project-automation bot added this to Backlog Nov 25, 2024

github-project-automation bot moved this to Ready in Backlog Nov 25, 2024

nameexhaustion self-assigned this Nov 26, 2024

nameexhaustion mentioned this issue Nov 26, 2024

fix: Incorrectly gave list.len() for masked-out rows #19999

Merged

ritchie46 closed this as completed in #19999 Nov 26, 2024

github-project-automation bot moved this from Ready to Done in Backlog Nov 26, 2024

c-peters added the accepted Ready for implementation label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list.len() reports len of a list that no longer exists #19987

list.len() reports len of a list that no longer exists #19987

aofarrel commented Nov 25, 2024

MarcoGorelli commented Nov 25, 2024

aofarrel commented Nov 26, 2024

list.len() reports len of a list that no longer exists #19987

list.len() reports len of a list that no longer exists #19987

Comments

aofarrel commented Nov 25, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Nov 25, 2024

aofarrel commented Nov 26, 2024