GH-41317: [C++] Fix crash on invalid Parquet file #41366

rouault · 2024-04-24T15:47:56Z

Rationale for this change

Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

What changes are included in this PR?

Add a validation that all read columns have the same size

Are these changes tested?

I've tested on the reproducer I provided in #41317 that it now triggers a clean error:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size

I'm not sure if/how unit tests for corrupted datasets should be added

Are there any user-facing changes?

No

This PR contains a "Critical Fix".

GitHub Issue: [C++] [Parquet] Crash / heap-buffer-overflow in TableBatchReader::ReadNext() on a corrupted Parquet file #41317

Lead-authored-by: mwish <[email protected]>

mapleFU · 2024-04-24T15:53:56Z

Will merge in 2days if no negative comments

felipecrv

Can you add the DCHECK in the TableBatchReader as well?

And note in the class docstring that the Table is expected to be valid prior to using it with the batch reader?

rouault · 2024-04-24T20:58:48Z

is expected to be valid prior to using it with the batch reader

done

rouault · 2024-04-24T21:21:13Z

It seems the added ValidateFull() breaks one test

[----------] 1 test from TestTableSortIndicesForTemporal/1, where TypeParam = class arrow::Date64Type
[ RUN      ] TestTableSortIndicesForTemporal/1.NoNull
D:/a/arrow/arrow/cpp/src/arrow/table.cc:622:  Check failed: _s.ok() Operation failed: table_.ValidateFull()
Bad status: Invalid: Column 0: In chunk 0: Invalid: date64[ms] 1 does not represent a whole number of days

I will have to let more knownledgeable people of the code base than me to investigate that.

felipecrv

If you use Validate() instead of ValidateFull() this won't be a problem.

Validate() contains the structural checks that ensure memory safety whereas ValidateFull() can get very deep into the validity of individual values of an array and is very slow.

arrow/cpp/src/arrow/array/validate.cc

Lines 189 to 193 in 192de02

    
           constexpr c_type kFullDayMillis = 1000 * 60 * 60 * 24; 
        
           if (date % kFullDayMillis != 0) { 
        
             return Status::Invalid(type, " ", date, 
        
                                    " does not represent a whole number of days"); 
        
           }

cpp/src/arrow/table.cc

rouault · 2024-04-28T10:40:51Z

This PR should be ready for merge (as far as I can see the test failures are also found in other pull requests)

mapleFU

Will wait @felipecrv comment here

cpp/src/arrow/table.h

…debug check

mapleFU · 2024-04-29T17:41:23Z

Will merge if no negative comment in one day

conbench-apache-arrow · 2024-04-30T14:56:18Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e4f3146.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

### Rationale for this change Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: apache#41317 Authored-by: Even Rouault <[email protected]> Signed-off-by: mwish <[email protected]>

apacheGH-41317: [C++] Fix crash on invalid Parquet file

0a134e3

Lead-authored-by: mwish <[email protected]>

rouault requested a review from wgtmac as a code owner April 24, 2024 15:47

github-actions bot added Component: Parquet Component: C++ labels Apr 24, 2024

rouault mentioned this pull request Apr 24, 2024

GH-41317: [C++] Fix crash on invalid Parquet file #41320

Closed

github-actions bot added the awaiting review Awaiting review label Apr 24, 2024

mapleFU approved these changes Apr 24, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 24, 2024

felipecrv requested changes Apr 24, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 24, 2024

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 24, 2024

felipecrv reviewed Apr 24, 2024

View reviewed changes

cpp/src/arrow/table.cc Outdated Show resolved Hide resolved

cpp/src/arrow/table.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 24, 2024

rouault force-pushed the fix_gh_41317_bis branch from d17d3ca to 90f2c07 Compare April 27, 2024 14:16

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 27, 2024

mapleFU approved these changes Apr 28, 2024

View reviewed changes

cpp/src/arrow/table.h Outdated Show resolved Hide resolved

apacheGH-41317: [C++] TableBatchReader constructor: add a Validate() …

9002fac

…debug check

rouault force-pushed the fix_gh_41317_bis branch from 90f2c07 to 9002fac Compare April 28, 2024 13:23

mapleFU requested a review from felipecrv April 29, 2024 03:50

felipecrv approved these changes Apr 29, 2024

View reviewed changes

github-actions bot removed the awaiting change review Awaiting change review label Apr 29, 2024

github-actions bot added the awaiting merge Awaiting merge label Apr 29, 2024

mapleFU merged commit e4f3146 into apache:main Apr 30, 2024
34 of 37 checks passed

mapleFU removed the awaiting merge Awaiting merge label Apr 30, 2024

mapleFU mentioned this pull request Apr 30, 2024

[C++] [Parquet] Crash / heap-buffer-overflow in TableBatchReader::ReadNext() on a corrupted Parquet file #41317

Closed

jonkeane mentioned this pull request Jun 20, 2024

[CI][R] Resolve Valgrind errors #42234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41317: [C++] Fix crash on invalid Parquet file #41366

GH-41317: [C++] Fix crash on invalid Parquet file #41366

rouault commented Apr 24, 2024 •

edited

Loading

mapleFU commented Apr 24, 2024

felipecrv left a comment

rouault commented Apr 24, 2024

rouault commented Apr 24, 2024

felipecrv left a comment

rouault commented Apr 28, 2024

mapleFU left a comment

mapleFU commented Apr 29, 2024

conbench-apache-arrow bot commented Apr 30, 2024

	constexpr c_type kFullDayMillis = 1000 * 60 * 60 * 24;
	if (date % kFullDayMillis != 0) {
	return Status::Invalid(type, " ", date,
	" does not represent a whole number of days");
	}

GH-41317: [C++] Fix crash on invalid Parquet file #41366

GH-41317: [C++] Fix crash on invalid Parquet file #41366

Conversation

rouault commented Apr 24, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mapleFU commented Apr 24, 2024

felipecrv left a comment

Choose a reason for hiding this comment

rouault commented Apr 24, 2024

rouault commented Apr 24, 2024

felipecrv left a comment

Choose a reason for hiding this comment

rouault commented Apr 28, 2024

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU commented Apr 29, 2024

conbench-apache-arrow bot commented Apr 30, 2024

rouault commented Apr 24, 2024 •

edited

Loading