-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet min_max statistics cannot read by pyarrow
#799
Comments
@tfiasco I think this is due to the c++ implementation only using min_value and max_value if a column order is set: https://github.com/apache/arrow/blob/54460d96ba1d613e472d8d9a96c072147e736b4d/cpp/src/parquet/metadata.cc#L82
This prints
So it looks like the c++ implementation is just ignoring the current stats values. While digging through the code to figure this out I found a comment in parquet_format that said that without column_orders the meaning of min_value and max_value is undefined. If the comment is accurate this seems like a bug in the current implementation that the min_value and max_value are being used the way that they are. The comment in question is:
|
Describe the bug
a parquet file created by
arrow-rs
has no min_max statistics when reading bypyarrow
.To Reproduce
Expected behavior
pyarrow should get statistics like
Additional context
rust lib version:
python lib version:
The text was updated successfully, but these errors were encountered: