-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41317: [C++] Fix crash on invalid Parquet file #41366
Conversation
Lead-authored-by: mwish <[email protected]>
Will merge in 2days if no negative comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the DCHECK
in the TableBatchReader
as well?
And note in the class docstring that the Table
is expected to be valid prior to using it with the batch reader?
done |
It seems the added ValidateFull() breaks one test
I will have to let more knownledgeable people of the code base than me to investigate that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use Validate()
instead of ValidateFull()
this won't be a problem.
Validate()
contains the structural checks that ensure memory safety whereas ValidateFull()
can get very deep into the validity of individual values of an array and is very slow.
arrow/cpp/src/arrow/array/validate.cc
Lines 189 to 193 in 192de02
constexpr c_type kFullDayMillis = 1000 * 60 * 60 * 24; | |
if (date % kFullDayMillis != 0) { | |
return Status::Invalid(type, " ", date, | |
" does not represent a whole number of days"); | |
} |
d17d3ca
to
90f2c07
Compare
This PR should be ready for merge (as far as I can see the test failures are also found in other pull requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will wait @felipecrv comment here
90f2c07
to
9002fac
Compare
Will merge if no negative comment in one day |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e4f3146. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: apache#41317 Authored-by: Even Rouault <[email protected]> Signed-off-by: mwish <[email protected]>
### Rationale for this change Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: apache#41317 Authored-by: Even Rouault <[email protected]> Signed-off-by: mwish <[email protected]>
Rationale for this change
Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file
What changes are included in this PR?
Add a validation that all read columns have the same size
Are these changes tested?
I've tested on the reproducer I provided in #41317 that it now triggers a clean error:
I'm not sure if/how unit tests for corrupted datasets should be added
Are there any user-facing changes?
No
This PR contains a "Critical Fix".