Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41317: [C++] Fix crash on invalid Parquet file #41366

Merged
merged 2 commits into from
Apr 30, 2024

Conversation

rouault
Copy link
Contributor

@rouault rouault commented Apr 24, 2024

Rationale for this change

Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

What changes are included in this PR?

Add a validation that all read columns have the same size

Are these changes tested?

I've tested on the reproducer I provided in #41317 that it now triggers a clean error:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size

I'm not sure if/how unit tests for corrupted datasets should be added

Are there any user-facing changes?

No

This PR contains a "Critical Fix".

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

Will merge in 2days if no negative comments

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 24, 2024
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the DCHECK in the TableBatchReader as well?

And note in the class docstring that the Table is expected to be valid prior to using it with the batch reader?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 24, 2024
@rouault
Copy link
Contributor Author

rouault commented Apr 24, 2024

is expected to be valid prior to using it with the batch reader

done

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 24, 2024
@rouault
Copy link
Contributor Author

rouault commented Apr 24, 2024

It seems the added ValidateFull() breaks one test

[----------] 1 test from TestTableSortIndicesForTemporal/1, where TypeParam = class arrow::Date64Type
[ RUN      ] TestTableSortIndicesForTemporal/1.NoNull
D:/a/arrow/arrow/cpp/src/arrow/table.cc:622:  Check failed: _s.ok() Operation failed: table_.ValidateFull()
Bad status: Invalid: Column 0: In chunk 0: Invalid: date64[ms] 1 does not represent a whole number of days

I will have to let more knownledgeable people of the code base than me to investigate that.

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use Validate() instead of ValidateFull() this won't be a problem.

Validate() contains the structural checks that ensure memory safety whereas ValidateFull() can get very deep into the validity of individual values of an array and is very slow.

constexpr c_type kFullDayMillis = 1000 * 60 * 60 * 24;
if (date % kFullDayMillis != 0) {
return Status::Invalid(type, " ", date,
" does not represent a whole number of days");
}

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 24, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 27, 2024
@rouault
Copy link
Contributor Author

rouault commented Apr 28, 2024

This PR should be ready for merge (as far as I can see the test failures are also found in other pull requests)

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait @felipecrv comment here

@mapleFU mapleFU requested a review from felipecrv April 29, 2024 03:50
@github-actions github-actions bot removed the awaiting change review Awaiting change review label Apr 29, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Apr 29, 2024
@mapleFU
Copy link
Member

mapleFU commented Apr 29, 2024

Will merge if no negative comment in one day

@mapleFU mapleFU merged commit e4f3146 into apache:main Apr 30, 2024
34 of 37 checks passed
@mapleFU mapleFU removed the awaiting merge Awaiting merge label Apr 30, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e4f3146.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
### Rationale for this change

Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

### What changes are included in this PR?

Add a validation that all read columns have the same size

### Are these changes tested?

I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error:
```
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size
```
I'm not sure if/how unit tests for corrupted datasets should be added

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#41317

Authored-by: Even Rouault <[email protected]>
Signed-off-by: mwish <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

### What changes are included in this PR?

Add a validation that all read columns have the same size

### Are these changes tested?

I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error:
```
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size
```
I'm not sure if/how unit tests for corrupted datasets should be added

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#41317

Authored-by: Even Rouault <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants