Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix incorrect parquet statistics written for UInt64 values > Int64::MAX #16766

Merged
merged 1 commit into from
Jun 6, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jun 6, 2024

Fixes #16683, #15323

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jun 6, 2024
@nameexhaustion
Copy link
Collaborator Author

Test failure due to pip

Copy link

codspeed-hq bot commented Jun 6, 2024

CodSpeed Performance Report

Merging #16766 will improve performances by 17.4%

Comparing nameexhaustion:pq-statistics (7efd2fd) with main (2398b47)

Summary

⚡ 1 improvements
✅ 36 untouched benchmarks

Benchmarks breakdown

Benchmark main nameexhaustion:pq-statistics Change
test_groupby_h2oai_q1 2.7 ms 2.3 ms +17.4%

Copy link

codecov bot commented Jun 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.31%. Comparing base (a7f9c8d) to head (7efd2fd).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16766      +/-   ##
==========================================
- Coverage   81.32%   81.31%   -0.01%     
==========================================
  Files        1423     1424       +1     
  Lines      187177   187209      +32     
  Branches     2721     2726       +5     
==========================================
+ Hits       152214   152228      +14     
- Misses      34468    34484      +16     
- Partials      495      497       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit b329894 into pola-rs:main Jun 6, 2024
28 of 29 checks passed
@stinodego stinodego changed the title fix: Incorrect parquet statistics written for UInt64 values > Int64::MAX fix: Fix incorrect parquet statistics written for UInt64 values > Int64::MAX Jun 6, 2024
@deanm0000
Copy link
Collaborator

Not trying to sound like I'm criticizing but this doesn't really fix the issue even as it does avoid them in the future. The fix, I think, would be to add a check in the reader that if min_value is greater than max_value then ignore the stats and possibly issue a warning that the stats are invalid.

@deanm0000
Copy link
Collaborator

This is what I had in mind from my last comment. #16776

@nameexhaustion
Copy link
Collaborator Author

That's a good point, I imagine there are parquet files in the wild that have incorrect statistics (perhaps by us 😝), that some basic validation could help with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle parquet files with incorrect statistics in scan_parquet
3 participants