Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][C++] Reading parquet with an empty list of row group indices fails #45339

Closed
romankarlstetter opened this issue Jan 23, 2025 · 2 comments

Comments

@romankarlstetter
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

The pull request #43945 changes TransferColumnData() to require a parquet::ColumnChunkMetaData parameter. LeafReader::LoadBatch calls input_->column_chunk_metadata(), but this fails if the list of row group indices is empty.

See https://github.com/apache/arrow/pull/43945/files/dcd3bdfd1ae313733c670567cd692016cb5523a9#r1926722505

This affects releases >=v18.0.0, arrow until 17.x works as expected.

Component(s)

C++, Parquet

@kou
Copy link
Member

kou commented Jan 24, 2025

FYI: @amoeba 19.0.1 should include a fix for this.

@raulcd raulcd marked this as a duplicate of #45343 Jan 24, 2025
kou added a commit to kou/arrow that referenced this issue Jan 25, 2025
kou added a commit that referenced this issue Jan 30, 2025
…nd multiple row groups (#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou
Copy link
Member

kou commented Jan 30, 2025

Issue resolved by pull request 45350
#45350

@kou kou closed this as completed Jan 30, 2025
lriggs pushed a commit to lriggs/arrow that referenced this issue Jan 30, 2025
…roup and multiple row groups (apache#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: apache#45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
amoeba pushed a commit that referenced this issue Jan 31, 2025
…nd multiple row groups (#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
amoeba pushed a commit that referenced this issue Jan 31, 2025
…nd multiple row groups (#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants