-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ListingTable to use StatisticsConverter, remove redundant statistics extraction code #10924
Conversation
ParquetStatistics::Int64(s) if DataType::Int64 == *fields[i].data_type() => { | ||
if let Some(max_value) = &mut max_values[i] { | ||
max_value | ||
.update_batch(&[Arc::new(Int64Array::from_value(*s.max(), 1))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, even though update_batch is called, it first creates a single row array
} | ||
} | ||
let file_field = file_schema.field(file_idx); | ||
let Some(converter) = StatisticsConverter::try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code now uses the well tested StatisticsConverter to extract statistics from the parquet file with the correct type of array in a single call
// Note in ASCII lower case is greater than upper case | ||
assert_eq!( | ||
c1_stats.max_value, | ||
Precision::Exact(ScalarValue::from("bar")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings were previously not handled, but are now properly handled by StatisticsConverter
90a5ea4
to
3e6a537
Compare
@@ -344,6 +344,24 @@ impl ColumnStatistics { | |||
distinct_count: Precision::Absent, | |||
} | |||
} | |||
|
|||
/// Set the null count for the column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions make creating ColumnStatitistic
more ergonomic
Thank you for the review @xinlifoobar
Update: let's work with #11068 (I didn't realize there was another PR) Thanks @xinlifoobar |
Which issue does this PR close?
Closes #10923
Rationale for this change
The code that extracted statistics for ListingTable with parquet files:
I also personally also found the previous code somewhat hard to follow
See #10923 for more details
What changes are included in this PR?
Update Parquet listing table to use the new
StatisticsConverter
API to create DataFusion statisticsAre these changes tested?
Yes, they are covered by existing tests
Are there any user-facing changes?
ListingFiles will now have more accurate statistics