Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350

Merged
merged 7 commits into from
Jan 30, 2025

Conversation

kou
Copy link
Member

@kou kou commented Jan 25, 2025

Rationale for this change

Loading arrow::ArrayStatistics logic depends on parquet::ColumnChunkMetaData.

We can't get parquet::ColumnChunkMetaData when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple parquet::ColumnChunkMetaDatas for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups.

What changes are included in this PR?

  • Don't load statistics when no row groups are used
  • Don't load statistics when multiple row groups are used
  • Add parquet::ArrowReaderProperties::{set_,}should_load_statistics() to enforce loading statistics by loading row group one by one

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou kou requested a review from wgtmac as a code owner January 25, 2025 10:53
Copy link

⚠️ GitHub issue #45339 has been automatically assigned in GitHub to PR creator.

Comment on lines 1030 to 1033
if (batch_size == 0) {
// We can return end immediately for 0 batch size
return ::arrow::IterationTraits<RecordBatchIterator>::End();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this reader implementation but can we do this optimization?
If we can do this, we can assume that row group/column chunk metadata are available in data load logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this fixes the issue from calling FileReader::GetRecordBatchReader. However, is it cleaner to fix it in the LeafReader::LoadBatch? There might be use cases that directly use the ColumnReader interface by calling FileReader::GetColumn instead of FileReader::GetRecordBatchReader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Status LoadBatch(int64_t records_to_read) final has called AttachStatistics, which forces statistics exists, and the metadata will also calles rowgroup to call input_->column_chunk_metadata()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.

I'll change the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 25, 2025
TEST(StatisticsTest, RequestNoRowGroup) {
// Build input
auto schema = ::arrow::schema({::arrow::field("column", ::arrow::int32())});
auto built_record_batch = RecordBatchFromJSON(schema, R"([[1], [null], [-1]])");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also write an empty parquet file to test this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found another problem here. Assume

Status LoadBatch(int64_t records_to_read) final {
here, the single return Array might comes from multiple rowgroups, thus, the input_->column_chunk_metadata() might only contains partial of the statistics.

The solving might be:

  bool should_load_statistics; // config for whether we need to load the statistics
  Status LoadBatch(int64_t records_to_read) final {
    BEGIN_PARQUET_CATCH_EXCEPTIONS
    out_ = nullptr;
    record_reader_->Reset();
    // Pre-allocation gives much better performance for flat columns
    record_reader_->Reserve(records_to_read);
    while (records_to_read > 0) {
      if (!record_reader_->HasMoreData()) {
        break;
      }
      int64_t records_read = record_reader_->ReadRecords(records_to_read);
      records_to_read -= records_read;
      if (records_read == 0) {
        NextRowGroup();
      } else if (records_read > 0 && should_load_statistics) { break; }
    }

@kou
Copy link
Member Author

kou commented Jan 28, 2025

I also found another problem here.

Good catch! I didn't notice the case... Sorry...

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 28, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 28, 2025
@kou kou changed the title GH-45339: [Parquet][C++] Read nothing when requested row groups are empty GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups Jan 28, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 28, 2025
Comment on lines +232 to +234
if (values.size() > 0) {
RETURN_NOT_OK(builder.AppendValues(values.data(), values.size(), valid_bytes.data()));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for avoid passing nullptr when num_rows == 0.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 28, 2025
ASSERT_FALSE(record_batch);
}

TEST(TestArrowColumnReader, NextBatchZeroBatchSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is exactly the same as RecordBatchReaderEmptyRowGroups:
I manually did a diff to see if I was missing anything:

$ diff 1.cpp 2.cpp 
1c1
< TEST(TestArrowColumnReader, NextBatchZeroBatchSize) {
---
> TEST(TestArrowFileReader, RecordBatchReaderEmptyRowGroups) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. It's a copy & paste mistake...
I should have used ColumnReader->NextBatch()...

@@ -913,7 +913,8 @@ class PARQUET_EXPORT ArrowReaderProperties {
pre_buffer_(true),
cache_options_(::arrow::io::CacheOptions::LazyDefaults()),
coerce_int96_timestamp_unit_(::arrow::TimeUnit::NANO),
arrow_extensions_enabled_(false) {}
arrow_extensions_enabled_(false),
should_load_statistics_(false) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only for my understanding, why is the reason we default to not loading statistics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for performance.
The current implementation may concatenate multiple Parquet column data in multiple row groups to one Arrow array. If we always load statistics, we can't do it. Because we don't have statistics merge implementation. If we have a statistics merge implementation, we can use true here. Because we can still concatenate multiple Parquet column data in multiple row groups to one Arrow array even when we load statistics.

@github-actions github-actions bot removed the awaiting changes Awaiting changes label Jan 28, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 28, 2025
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kou kou merged commit 3e6e8f3 into apache:main Jan 30, 2025
35 checks passed
@kou kou removed the awaiting changes Awaiting changes label Jan 30, 2025
@kou kou deleted the cpp-parquet-statistics branch January 30, 2025 04:27
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3e6e8f3.

There were 8 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

lriggs pushed a commit to lriggs/arrow that referenced this pull request Jan 30, 2025
…roup and multiple row groups (apache#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: apache#45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
amoeba pushed a commit that referenced this pull request Jan 31, 2025
…nd multiple row groups (#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
amoeba pushed a commit that referenced this pull request Jan 31, 2025
…nd multiple row groups (#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants