Incorrect result when filtering a Parquet file on categorical columns #17744

agossard · 2024-07-19T19:24:27Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import numpy as np
import pandas as pd
import random
import string

Helper functions to generate random data

def random_string(length=5):
letters = string.ascii_uppercase
return ''.join(random.choice(letters) for i in range(length))

def random_category(categories):
return random.choice(categories)

def random_float():
return random.uniform(1e6, 1e11)

Define the categories for categorical columns

gics_industry_group = ["Materials", "Capital Goods"]
pricing_currency = ["USD", "CAD", "EUR", "AUD", "INR", "KRW"]
country = ["AU", "US", "CA", "JP", "IN", "KR"]
gics_sector_name = ["Materials", "Industrials"]
gics_industry_name = ["Metals & Mining", "Machinery", "Banks", "Retail"]

Number of rows

num_rows = 42265

Create the DataFrame

df = pl.DataFrame({
"dt": np.random.randint(20240101, 20240630, size=num_rows),
"global_id": np.random.randint(1, 11260000, size=num_rows),
"gics_industry_group": [random_category(gics_industry_group) for _ in range(num_rows)],
"pricing_currency": [random_category(pricing_currency) for _ in range(num_rows)],
"ticker": [random_string() for _ in range(num_rows)],
"country": [random_category(country) for _ in range(num_rows)],
"gics_sector_name": [random_category(gics_sector_name) for _ in range(num_rows)],
"market_cap_usd": [random_float() for _ in range(num_rows)],
"name": ["Company " + random_string(10) for _ in range(num_rows)],
"gics_industry_name": [random_category(gics_industry_name) for _ in range(num_rows)]
})

df = df.with_columns(pl.col('gics_industry_name').cast(pl.Categorical))

df.write_parquet('test_2.parquet')

Filtering before collect -> Returns 0 rows (incorrect)

print(len(pl.scan_parquet('test_2.parquet').filter(pl.col('gics_industry_name').is_in(['Metals & Mining', 'Machinery'])).collect()))

Filtering after collect -> Returns >0 rows (correct)

print(len(pl.scan_parquet('test_2.parquet').collect().filter(pl.col('gics_industry_name').is_in(['Metals & Mining', 'Machinery']))))

Log output

No response

Issue description

This seems to be a regression, as this behavior was working for me in 1.1 and is no longer working in 1.2. (Note, I observed this same bug many versions back, also a regression, which was subsequently corrected).

I have a set of parquet files on s3, which contain a column which is a categorical string. If I do this, I get the expected outcome:

pl.scan_parquet(....).collect().filter(pl.col('my_categorical_column').is_in(['a category']))

However, if I do this (attempting to get the filter applied during the reading process)

pl.scan_parquet(....).filter(pl.col('my_categorical_column').is_in(['a category'])).collect()

Then I get no rows returned in the data frame.

Expected behavior

The results should be the same regardless of where collect is called. The filter when run with lazy frame should retain rows that pass the filter.

Installed versions

--------Version info---------

Polars:               1.1.0

Index type:           UInt32

Platform:             Linux-4.14.281-212.502.amzn2.x86_64-x86_64-with-glibc2.26

Python:               3.11.5 (main, Sep 11 2023, 13:23:44) [GCC 11.2.0]

 

----Optional dependencies----

adbc_driver_manager:  <not installed>

cloudpickle:          2.2.1

connectorx:           <not installed>

deltalake:            <not installed>

fastexcel:            <not installed>

fsspec:               2023.10.0

gevent:               <not installed>

great_tables:         <not installed>

hvplot:               0.9.1

matplotlib:           3.8.4

nest_asyncio:         1.5.6

numpy:                1.26.3

openpyxl:             3.1.2

pandas:               2.0.3

pyarrow:              15.0.0

pydantic:             2.5.3

pyiceberg:            <not installed>

sqlalchemy:           1.4.52

torch:                2.3.0

xlsx2csv:             <not installed>

xlsxwriter:           3.1.1

The text was updated successfully, but these errors were encountered:

agossard · 2024-07-20T01:35:09Z

Sorry... just a correction on the pasted in versions. The bug is present in polars 1.2 and 1.2.1. In polars 1.1, the behavior is correct.

alonme · 2024-07-20T18:10:08Z

This issue is of course harder to solve with no example,
Does this still happen if you filter the parquet to a single line?
Does it happen if you remove the other columns?

ritchie46 · 2024-07-21T12:51:06Z

Have you got a repro? We cannot do anything if we cant reproduce it.

agossard · 2024-07-22T11:27:45Z

Ok… I was able to create a reproducible example actually… apologies for not trying in the first place!

might consider looking at #17475

alexander-beedie · 2024-07-25T10:47:31Z

Have you got a repro? We cannot do anything if we cant reproduce it.

@ritchie46, @coastalwhite: looks like strings aren't being adapted to match categoricals with is_in filters any more. Here's a simple/streamlined repro 👌

import polars as pl

pl.DataFrame(
    data={
        "n": [1,2,3],
        "ccy": ["USD", "JPY", "EUR"],
    },
    schema_overrides={"ccy": pl.Categorical("lexical")},
).write_parquet(pq := "test.parquet")

This still works (==):

pl.scan_parquet(pq).filter(pl.col("ccy") == "USD").collect()
# shape: (1, 2)
# ┌─────┬─────┐
# │ n   ┆ ccy │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═════╪═════╡
# │ 1   ┆ USD │
# └─────┴─────┘

This now fails (is_in):

pl.scan_parquet(pq).filter(pl.col("ccy").is_in(["USD"])).collect()
# shape: (0, 2)
# ┌─────┬─────┐
# │ n   ┆ ccy │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═════╪═════╡
# └─────┴─────┘

coastalwhite · 2024-07-25T13:26:06Z

I will take a look at this. I am currently throwing quite a bit refactor over the parquet code that effects Categoricals as well. So probably after that.

nameexhaustion · 2024-07-30T22:45:49Z

Thanks @alexander-beedie for the repro 👍

agossard added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 19, 2024

ritchie46 added the needs repro Bug does not yet have a reproducible example label Jul 21, 2024

alexander-beedie added A-io-parquet Area: reading/writing Parquet files and removed needs repro Bug does not yet have a reproducible example labels Jul 25, 2024

agossard mentioned this issue Jul 30, 2024

Dataframe wrong filtering with scan_parquet and Categorical data #17944

Closed

2 tasks

nameexhaustion added accepted Ready for implementation P-high Priority: high and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jul 30, 2024

github-project-automation bot added this to Backlog Jul 30, 2024

github-project-automation bot moved this to Ready in Backlog Jul 30, 2024

nameexhaustion added the A-optimizer Area: plan optimization label Jul 30, 2024

nameexhaustion changed the title ~~Problem filtering categorical string columns with lazy frame and scan_parquet~~ Incorrect result when filtering a Parquet file on categorical columns Jul 30, 2024

nameexhaustion mentioned this issue Jul 30, 2024

fix: Incorrect filter on categorical columns from parquet files #17950

Merged

ritchie46 closed this as completed in #17950 Jul 31, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jul 31, 2024

c-peters assigned nameexhaustion Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect result when filtering a Parquet file on categorical columns #17744

Incorrect result when filtering a Parquet file on categorical columns #17744

agossard commented Jul 19, 2024 •

edited

Loading

agossard commented Jul 20, 2024

alonme commented Jul 20, 2024

ritchie46 commented Jul 21, 2024

agossard commented Jul 22, 2024

alexander-beedie commented Jul 25, 2024 •

edited

Loading

coastalwhite commented Jul 25, 2024

nameexhaustion commented Jul 30, 2024

Incorrect result when filtering a Parquet file on categorical columns #17744

Incorrect result when filtering a Parquet file on categorical columns #17744

Comments

agossard commented Jul 19, 2024 • edited Loading

Checks

Reproducible example

Helper functions to generate random data

Define the categories for categorical columns

Number of rows

Create the DataFrame

Filtering before collect -> Returns 0 rows (incorrect)

Filtering after collect -> Returns >0 rows (correct)

Log output

Issue description

Expected behavior

Installed versions

agossard commented Jul 20, 2024

alonme commented Jul 20, 2024

ritchie46 commented Jul 21, 2024

agossard commented Jul 22, 2024

alexander-beedie commented Jul 25, 2024 • edited Loading

coastalwhite commented Jul 25, 2024

nameexhaustion commented Jul 30, 2024

agossard commented Jul 19, 2024 •

edited

Loading

alexander-beedie commented Jul 25, 2024 •

edited

Loading