GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350

kou · 2025-01-25T10:53:48Z

Rationale for this change

Loading arrow::ArrayStatistics logic depends on parquet::ColumnChunkMetaData.

We can't get parquet::ColumnChunkMetaData when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple parquet::ColumnChunkMetaDatas for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups.

What changes are included in this PR?

Don't load statistics when no row groups are used
Don't load statistics when multiple row groups are used
Add parquet::ArrowReaderProperties::{set_,}should_load_statistics() to enforce loading statistics by loading row group one by one

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

GitHub Issue: [Parquet][C++] Reading parquet with an empty list of row group indices fails #45339

… are empty

github-actions · 2025-01-25T10:54:14Z

⚠️ GitHub issue #45339 has been automatically assigned in GitHub to PR creator.

kou · 2025-01-25T10:55:36Z

cpp/src/parquet/arrow/reader.cc

+        if (batch_size == 0) {
+          // We can return end immediately for 0 batch size
+          return ::arrow::IterationTraits<RecordBatchIterator>::End();
+        }


I'm not familiar with this reader implementation but can we do this optimization?
If we can do this, we can assume that row group/column chunk metadata are available in data load logic.

Yes, this fixes the issue from calling FileReader::GetRecordBatchReader. However, is it cleaner to fix it in the LeafReader::LoadBatch? There might be use cases that directly use the ColumnReader interface by calling FileReader::GetColumn instead of FileReader::GetRecordBatchReader.

Agree. Status LoadBatch(int64_t records_to_read) final has called AttachStatistics, which forces statistics exists, and the metadata will also calles rowgroup to call input_->column_chunk_metadata()

It makes sense.

I'll change the implementation.

mapleFU · 2025-01-25T12:58:09Z

cpp/src/parquet/arrow/arrow_statistics_test.cc

+TEST(StatisticsTest, RequestNoRowGroup) {
+  // Build input
+  auto schema = ::arrow::schema({::arrow::field("column", ::arrow::int32())});
+  auto built_record_batch = RecordBatchFromJSON(schema, R"([[1], [null], [-1]])");


could we also write an empty parquet file to test this?

mapleFU

I also found another problem here. Assume

arrow/cpp/src/parquet/arrow/reader.cc

Line 472 in 2c90daf

Status LoadBatch(int64_t records_to_read) final {

here, the single return Array might comes from multiple rowgroups, thus, the input_->column_chunk_metadata() might only contains partial of the statistics.

The solving might be:

  bool should_load_statistics; // config for whether we need to load the statistics
  Status LoadBatch(int64_t records_to_read) final {
    BEGIN_PARQUET_CATCH_EXCEPTIONS
    out_ = nullptr;
    record_reader_->Reset();
    // Pre-allocation gives much better performance for flat columns
    record_reader_->Reserve(records_to_read);
    while (records_to_read > 0) {
      if (!record_reader_->HasMoreData()) {
        break;
      }
      int64_t records_read = record_reader_->ReadRecords(records_to_read);
      records_to_read -= records_read;
      if (records_read == 0) {
        NextRowGroup();
      } else if (records_read > 0 && should_load_statistics) { break; }
    }

kou · 2025-01-28T02:23:03Z

I also found another problem here.

Good catch! I didn't notice the case... Sorry...

kou · 2025-01-28T08:06:51Z

cpp/src/parquet/arrow/test_util.h

+  if (values.size() > 0) {
+    RETURN_NOT_OK(builder.AppendValues(values.data(), values.size(), valid_bytes.data()));
+  }


This is for avoid passing nullptr when num_rows == 0.

raulcd · 2025-01-28T10:12:44Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  ASSERT_FALSE(record_batch);
+}
+
+TEST(TestArrowColumnReader, NextBatchZeroBatchSize) {


This test is exactly the same as RecordBatchReaderEmptyRowGroups:
I manually did a diff to see if I was missing anything:

$ diff 1.cpp 2.cpp 1c1 < TEST(TestArrowColumnReader, NextBatchZeroBatchSize) { --- > TEST(TestArrowFileReader, RecordBatchReaderEmptyRowGroups) {

Oh, sorry. It's a copy & paste mistake...
I should have used ColumnReader->NextBatch()...

raulcd · 2025-01-28T10:18:52Z

cpp/src/parquet/properties.h

@@ -913,7 +913,8 @@ class PARQUET_EXPORT ArrowReaderProperties {
        pre_buffer_(true),
        cache_options_(::arrow::io::CacheOptions::LazyDefaults()),
        coerce_int96_timestamp_unit_(::arrow::TimeUnit::NANO),
-        arrow_extensions_enabled_(false) {}
+        arrow_extensions_enabled_(false),
+        should_load_statistics_(false) {}


only for my understanding, why is the reason we default to not loading statistics?

It's for performance.
The current implementation may concatenate multiple Parquet column data in multiple row groups to one Arrow array. If we always load statistics, we can't do it. Because we don't have statistics merge implementation. If we have a statistics merge implementation, we can use true here. Because we can still concatenate multiple Parquet column data in multiple row groups to one Arrow array even when we load statistics.

mapleFU

LGTM!

conbench-apache-arrow · 2025-01-30T10:53:44Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3e6e8f3.

There were 8 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2025-01-30 05:46:35Z
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-03, scale_factor=1
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-03, scale_factor=1
and 6 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

…roup and multiple row groups (apache#45350) ### Rationale for this change Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`. We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist. We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. ### What changes are included in this PR? * Don't load statistics when no row groups are used * Don't load statistics when multiple row groups are used * Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#45339 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…nd multiple row groups (#45350) ### Rationale for this change Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`. We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist. We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. ### What changes are included in this PR? * Don't load statistics when no row groups are used * Don't load statistics when multiple row groups are used * Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45339 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

apacheGH-45339: [Parquet][C++] Read nothing when requested row groups…

471ea67

… are empty

kou requested a review from wgtmac as a code owner January 25, 2025 10:53

github-actions bot added Component: Parquet Component: C++ awaiting committer review Awaiting committer review labels Jan 25, 2025

kou commented Jan 25, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 25, 2025

mapleFU reviewed Jan 25, 2025

View reviewed changes

Check in LoadBatch()

7fe9453

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 28, 2025

kou added 3 commits January 28, 2025 16:23

Update comments

4742826

Revert a needless optimization

028aafc

Revert a needless change

a5baba4

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 28, 2025

kou changed the title ~~GH-45339: [Parquet][C++] Read nothing when requested row groups are empty~~ GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups Jan 28, 2025

Don't pass nullptr data

8b4ba1f

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 28, 2025

kou commented Jan 28, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 28, 2025

raulcd reviewed Jan 28, 2025

View reviewed changes

Fix test content

482a50f

github-actions bot removed the awaiting changes Awaiting changes label Jan 28, 2025

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 28, 2025

mapleFU approved these changes Jan 28, 2025

View reviewed changes

kou merged commit 3e6e8f3 into apache:main Jan 30, 2025
35 checks passed

kou removed the awaiting changes Awaiting changes label Jan 30, 2025

kou deleted the cpp-parquet-statistics branch January 30, 2025 04:27

kou mentioned this pull request Jan 30, 2025

[Parquet][C++] Reading parquet with an empty list of row group indices fails #45339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350

GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350

kou commented Jan 25, 2025 •

edited

Loading

github-actions bot commented Jan 25, 2025

kou Jan 25, 2025

wgtmac Jan 25, 2025

mapleFU Jan 25, 2025

kou Jan 28, 2025

kou Jan 28, 2025

mapleFU Jan 25, 2025

kou Jan 28, 2025

mapleFU left a comment

kou commented Jan 28, 2025

kou Jan 28, 2025

raulcd Jan 28, 2025

kou Jan 28, 2025

raulcd Jan 28, 2025

kou Jan 28, 2025

mapleFU left a comment

conbench-apache-arrow bot commented Jan 30, 2025

GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350

GH-45339: [Parquet][C++] Fix statistics load logic for no row group and multiple row groups #45350

Conversation

kou commented Jan 25, 2025 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jan 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

kou commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jan 30, 2025

kou commented Jan 25, 2025 •

edited

Loading