-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350
Conversation
|
cpp/src/parquet/arrow/reader.cc
Outdated
if (batch_size == 0) { | ||
// We can return end immediately for 0 batch size | ||
return ::arrow::IterationTraits<RecordBatchIterator>::End(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this reader implementation but can we do this optimization?
If we can do this, we can assume that row group/column chunk metadata are available in data load logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this fixes the issue from calling FileReader::GetRecordBatchReader
. However, is it cleaner to fix it in the LeafReader::LoadBatch
? There might be use cases that directly use the ColumnReader
interface by calling FileReader::GetColumn
instead of FileReader::GetRecordBatchReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Status LoadBatch(int64_t records_to_read) final
has called AttachStatistics
, which forces statistics exists, and the metadata will also calles rowgroup to call input_->column_chunk_metadata()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
I'll change the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
TEST(StatisticsTest, RequestNoRowGroup) { | ||
// Build input | ||
auto schema = ::arrow::schema({::arrow::field("column", ::arrow::int32())}); | ||
auto built_record_batch = RecordBatchFromJSON(schema, R"([[1], [null], [-1]])"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also write an empty parquet file to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found another problem here. Assume
arrow/cpp/src/parquet/arrow/reader.cc
Line 472 in 2c90daf
Status LoadBatch(int64_t records_to_read) final { |
input_->column_chunk_metadata()
might only contains partial of the statistics.
The solving might be:
bool should_load_statistics; // config for whether we need to load the statistics
Status LoadBatch(int64_t records_to_read) final {
BEGIN_PARQUET_CATCH_EXCEPTIONS
out_ = nullptr;
record_reader_->Reset();
// Pre-allocation gives much better performance for flat columns
record_reader_->Reserve(records_to_read);
while (records_to_read > 0) {
if (!record_reader_->HasMoreData()) {
break;
}
int64_t records_read = record_reader_->ReadRecords(records_to_read);
records_to_read -= records_read;
if (records_read == 0) {
NextRowGroup();
} else if (records_read > 0 && should_load_statistics) { break; }
}
Good catch! I didn't notice the case... Sorry... |
if (values.size() > 0) { | ||
RETURN_NOT_OK(builder.AppendValues(values.data(), values.size(), valid_bytes.data())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for avoid passing nullptr
when num_rows == 0
.
ASSERT_FALSE(record_batch); | ||
} | ||
|
||
TEST(TestArrowColumnReader, NextBatchZeroBatchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is exactly the same as RecordBatchReaderEmptyRowGroups
:
I manually did a diff to see if I was missing anything:
$ diff 1.cpp 2.cpp
1c1
< TEST(TestArrowColumnReader, NextBatchZeroBatchSize) {
---
> TEST(TestArrowFileReader, RecordBatchReaderEmptyRowGroups) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. It's a copy & paste mistake...
I should have used ColumnReader->NextBatch()
...
@@ -913,7 +913,8 @@ class PARQUET_EXPORT ArrowReaderProperties { | |||
pre_buffer_(true), | |||
cache_options_(::arrow::io::CacheOptions::LazyDefaults()), | |||
coerce_int96_timestamp_unit_(::arrow::TimeUnit::NANO), | |||
arrow_extensions_enabled_(false) {} | |||
arrow_extensions_enabled_(false), | |||
should_load_statistics_(false) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only for my understanding, why is the reason we default to not loading statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for performance.
The current implementation may concatenate multiple Parquet column data in multiple row groups to one Arrow array. If we always load statistics, we can't do it. Because we don't have statistics merge implementation. If we have a statistics merge implementation, we can use true
here. Because we can still concatenate multiple Parquet column data in multiple row groups to one Arrow array even when we load statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3e6e8f3. There were 8 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…roup and multiple row groups (apache#45350) ### Rationale for this change Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`. We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist. We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. ### What changes are included in this PR? * Don't load statistics when no row groups are used * Don't load statistics when multiple row groups are used * Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#45339 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…nd multiple row groups (#45350) ### Rationale for this change Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`. We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist. We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. ### What changes are included in this PR? * Don't load statistics when no row groups are used * Don't load statistics when multiple row groups are used * Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45339 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…nd multiple row groups (#45350) ### Rationale for this change Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`. We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist. We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. ### What changes are included in this PR? * Don't load statistics when no row groups are used * Don't load statistics when multiple row groups are used * Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45339 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
Loading
arrow::ArrayStatistics
logic depends onparquet::ColumnChunkMetaData
.We can't get
parquet::ColumnChunkMetaData
when requested row groups are empty because no associated row group and column chunk exist.We can't use multiple
parquet::ColumnChunkMetaData
s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups.What changes are included in this PR?
parquet::ArrowReaderProperties::{set_,}should_load_statistics()
to enforce loading statistics by loading row group one by oneAre these changes tested?
Yes.
Are there any user-facing changes?
Yes.