-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect result when filtering a Parquet file on categorical columns #17744
Comments
Sorry... just a correction on the pasted in versions. The bug is present in polars 1.2 and 1.2.1. In polars 1.1, the behavior is correct. |
This issue is of course harder to solve with no example, |
Have you got a repro? We cannot do anything if we cant reproduce it. |
Ok… I was able to create a reproducible example actually… apologies for not trying in the first place! might consider looking at #17475 |
@ritchie46, @coastalwhite: looks like strings aren't being adapted to match categoricals with import polars as pl
pl.DataFrame(
data={
"n": [1,2,3],
"ccy": ["USD", "JPY", "EUR"],
},
schema_overrides={"ccy": pl.Categorical("lexical")},
).write_parquet(pq := "test.parquet") This still works ( pl.scan_parquet(pq).filter(pl.col("ccy") == "USD").collect()
# shape: (1, 2)
# ┌─────┬─────┐
# │ n ┆ ccy │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═════╪═════╡
# │ 1 ┆ USD │
# └─────┴─────┘ This now fails ( pl.scan_parquet(pq).filter(pl.col("ccy").is_in(["USD"])).collect()
# shape: (0, 2)
# ┌─────┬─────┐
# │ n ┆ ccy │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═════╪═════╡
# └─────┴─────┘ |
I will take a look at this. I am currently throwing quite a bit refactor over the parquet code that effects |
Thanks @alexander-beedie for the repro 👍 |
Checks
Reproducible example
import polars as pl
import numpy as np
import pandas as pd
import random
import string
Helper functions to generate random data
def random_string(length=5):
letters = string.ascii_uppercase
return ''.join(random.choice(letters) for i in range(length))
def random_category(categories):
return random.choice(categories)
def random_float():
return random.uniform(1e6, 1e11)
Define the categories for categorical columns
gics_industry_group = ["Materials", "Capital Goods"]
pricing_currency = ["USD", "CAD", "EUR", "AUD", "INR", "KRW"]
country = ["AU", "US", "CA", "JP", "IN", "KR"]
gics_sector_name = ["Materials", "Industrials"]
gics_industry_name = ["Metals & Mining", "Machinery", "Banks", "Retail"]
Number of rows
num_rows = 42265
Create the DataFrame
df = pl.DataFrame({
"dt": np.random.randint(20240101, 20240630, size=num_rows),
"global_id": np.random.randint(1, 11260000, size=num_rows),
"gics_industry_group": [random_category(gics_industry_group) for _ in range(num_rows)],
"pricing_currency": [random_category(pricing_currency) for _ in range(num_rows)],
"ticker": [random_string() for _ in range(num_rows)],
"country": [random_category(country) for _ in range(num_rows)],
"gics_sector_name": [random_category(gics_sector_name) for _ in range(num_rows)],
"market_cap_usd": [random_float() for _ in range(num_rows)],
"name": ["Company " + random_string(10) for _ in range(num_rows)],
"gics_industry_name": [random_category(gics_industry_name) for _ in range(num_rows)]
})
df = df.with_columns(pl.col('gics_industry_name').cast(pl.Categorical))
df.write_parquet('test_2.parquet')
Filtering before collect -> Returns 0 rows (incorrect)
print(len(pl.scan_parquet('test_2.parquet').filter(pl.col('gics_industry_name').is_in(['Metals & Mining', 'Machinery'])).collect()))
Filtering after collect -> Returns >0 rows (correct)
print(len(pl.scan_parquet('test_2.parquet').collect().filter(pl.col('gics_industry_name').is_in(['Metals & Mining', 'Machinery']))))
Log output
No response
Issue description
This seems to be a regression, as this behavior was working for me in 1.1 and is no longer working in 1.2. (Note, I observed this same bug many versions back, also a regression, which was subsequently corrected).
I have a set of parquet files on s3, which contain a column which is a categorical string. If I do this, I get the expected outcome:
pl.scan_parquet(....).collect().filter(pl.col('my_categorical_column').is_in(['a category']))
However, if I do this (attempting to get the filter applied during the reading process)
pl.scan_parquet(....).filter(pl.col('my_categorical_column').is_in(['a category'])).collect()
Then I get no rows returned in the data frame.
Expected behavior
The results should be the same regardless of where collect is called. The filter when run with lazy frame should retain rows that pass the filter.
Installed versions
The text was updated successfully, but these errors were encountered: