ARROW-8062: [C++][Dataset] Implement ParquetDatasetFactory #7180

fsaintjacques · 2020-05-14T18:35:06Z

This patch adds the option to create a dataset of parquet files via ParquetDatasetFactory. It reads a single _metadata parquet file created by systems like Dask and Spark, extract the metadata of all fragments from said file, and populate each fragment with extra statistics for each columns. The _metadata file can be created via pyarrow.parquet.write_metadata.

When the Scan operation is materialised, the row groups of the ParquetFileFragment are elided with the statistics before reading the original file metadata. If no RowGroups from a file matches the predicate of the Scan, the file will not be read (including the metadata footer), thus avoiding expensive IO calls. The optimisation benefits are inversely proportional to the predicate's selectivity.

# With the plain FileSystemDataset
%timeit t = nyc_tlc_fs_dataset.to_table(filter=da.field('total_amount') > 1000.0, ...)                                                                                                                                      
1.55 s ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With ParquetDatasetFactory
%timeit t = nyc_tlc_parquet_dataset.to_table(filter=da.field('total_amount') > 1000.0, ...)                                                                                                                                 
336 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Implement ParquetDatasetFactory
Replace ParquetFileFormat::GetRowGroupFragments with
ParquetFileFragment::SplitByRowGroup (and the corresponding bindings).
Add various optimizations, notably in ColumnChunkStatisticsAsExpression.
Consolidate RowGroupSkipper logic in ParquetFileFragment::ScanFile
Ensure FileMetaData::AppendRowGroups checks for schema equality.
Implement dataset._parquet_dataset

github-actions · 2020-05-14T18:46:31Z

https://issues.apache.org/jira/browse/ARROW-8062

cpp/src/parquet/arrow/reader_internal.cc

jorisvandenbossche

Looking great! I tested on some dummy datasets created by dask, and that seems to work nicely (although it is of course hard to be sure there is actually no IO going on during the factory)

Will still further check with the existing parquet tests that use _metadata files.

Some non-inline comments:

Still need a partitioning keyword for parquet_dataset ?
We might want to detect that, when given a directory name, this directory includes a _metadata file? (in some API, maybe this can be in the pyarrow.parquet code)
For a follow-up: handling of _common_metadata (just for inspecting the common schema)
Regarding the statistics stored as an expression:
- Do we want to expose this in Python as well?
- Currently, the statistics are only available when the fragments were constructed from a _metadata file (or after querying once), I think? Do we want to allow to populate them on demand?
- Statistics are only attached to a RowGroupInfo, and not to a fragment? Not needed for this PR, to be clear, but thinking more generally: we might want to enable that you can create a Fragment with custom statistics (eg in Kartothek ? @xhochy)
There are some tests needed for the new write_metadata capabilities (I can write/push some if you want). And the same for parquet_dataset/ParquetDatasetFactory.

I ran into a segfault if one of the files is not present (writing a reproducer / test case right now)

cpp/src/arrow/dataset/file_parquet.h

cpp/src/parquet/metadata.h

jorisvandenbossche · 2020-05-19T11:54:44Z

cpp/src/arrow/dataset/file_parquet.cc

+    const std::shared_ptr<Expression>& predicate) {
+  ARROW_ASSIGN_OR_RAISE(auto reader, parquet_format_.GetReader(source_));
+  ARROW_ASSIGN_OR_RAISE(auto row_groups,
+                        AugmentAndFilter(row_groups_, *predicate, reader.get()));


Does this mean that using SplitByRowGroup currently always invokes IO? (in principle it could be filtered/split based on the already-read RowGroupInfo objects ?

Correct. It could be filtered, but only if the dataset was generated via the _metadata file (or any explicit RowGroupInfo).

I think it would be good to at least try to avoid IO if the statistics are already available in the RowGroupInfo's (at least, from my understanding how this is used in RAPIDS, can check with them), but certainly not critical to do in this PR.

cpp/src/arrow/dataset/file_parquet.cc

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2020-05-19T12:24:13Z

python/pyarrow/dataset.py

    ParquetFileFormat,
    ParquetFileFragment,
    ParquetReadOptions,
    Partitioning,
    PartitioningFactory,
+    RowGroupInfo,


Maybe we should not expose this publicly? (is there a reason you would want to use this directly?)

That's required for ParquetFileFragment.row_groups. I could change it to only return a list of integers.

python/pyarrow/dataset.py

python/pyarrow/parquet.py

fsaintjacques · 2020-05-19T18:00:02Z

I don't get a segault for the test you added, just a wrong exception being throw.

>   raise IOError(errno, message)
E   FileNotFoundError: [Errno 2] Failed to open local file '/tmp/pytest-of-fsaintjacques/pytest-44/test_parquet_dataset_factory_i0/test_parquet_dataset/43bd0bd1002048e0b9bbc730f7614d18.parquet'. Detail: [errno 2] No such file or directory

pyarrow/error.pxi:98: FileNotFoundError

jorisvandenbossche · 2020-05-19T18:12:53Z

I don't get a segault for the test you added, just a wrong exception being throw.

A FileNotFoundError sounds good (the ValueError I added in the tests was just a bit random). Will rebuild locally to see if I still get this

fsaintjacques · 2020-05-19T18:32:03Z

I'm curious about the exception/segfault. If you can reproduce, feel free to share.

jorisvandenbossche · 2020-05-19T20:19:22Z

It seems this failure doesn't happen all the time for me. Running it a few times, I see also see the FileNotFoundError, but in 1 out of 2 cases, approximately.

Now, when running it on an actual example in the interactive terminal (from a small dataset where I actually deleted one of the files), I consistently see the segfault:

In [1]: import pyarrow.dataset as ds                                                                                                                                                                               

In [2]: dataset = ds.parquet_dataset("/tmp/tmp9qt6cph5/_metadata")                                                                                                                                                 

In [3]: dataset.to_table()                                                                                                                                                                                         
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
Aborted (core dumped)

I get a different stacktrace when running the tests, something about the "unlink", so it might be that the way I remove the file in the test is not very robust / is not similar to deleting a file in the file browser.

- Implement ParquetDatasetFactory - Replace ParquetFileFormat::GetRowGroupFragments with ParquetFileFragment::SplitByRowGroup (and the corresponding bindings). - Add various optimizations, notably in ColumnChunkStatisticsAsExpression. - Consolidate RowGroupSkipper logic in ParquetFileFragment::GetRowGroupFragments. - Ensure FileMetaData::AppendRowGroups checks for schema equality. - Implement dataset._parquet_dataset

bkietz

This looks great, thanks for doing this!

A few comments:

cpp/src/arrow/dataset/file_parquet.cc

bkietz · 2020-05-18T20:17:08Z

cpp/src/arrow/dataset/file_parquet.cc

-  int num_row_groups_;
-  int64_t rows_skipped_;
-};
-
 class ParquetScanTaskIterator {
 public:
  static Result<ScanTaskIterator> Make(std::shared_ptr<ScanOptions> options,


Nothing here can fail so we can just make the constructor public

cpp/src/arrow/dataset/file_parquet.cc

bkietz · 2020-05-18T21:13:06Z

cpp/src/arrow/dataset/file_parquet.cc

-  std::vector<int> column_projection_;
-  RowGroupSkipper skipper_;
+
+  FileSource source_;


What is this used for?

For debug purposes, this is extremely useful to introspect the object.

cpp/src/arrow/dataset/file_parquet.cc

cpp/src/arrow/dataset/scanner.cc

cpp/src/parquet/arrow/reader_internal.cc

bkietz · 2020-05-20T18:22:03Z

python/pyarrow/_dataset.pyx

        """
+        Split the fragment in multiple fragments.


Suggested change

Split the fragment in multiple fragments.

Split the fragment into multiple fragments.

Could you replicate this comment in c++?

python/pyarrow/_dataset.pyx

fsaintjacques · 2020-05-21T15:09:18Z

cpp/src/arrow/dataset/file_parquet.cc

+Result<FragmentVector> ParquetFileFragment::SplitByRowGroup(
+    const std::shared_ptr<Expression>& predicate) {
+  std::vector<RowGroupInfo> row_groups;
+  if (HasCompleteMetadata()) {


@jorisvandenbossche this is now lazy.

@jorisvandenbossche this is now lazy.

Cool, thanks!

python/pyarrow/tests/test_dataset.py

This patch adds the option to create a dataset of parquet files via `ParquetDatasetFactory`. It reads a single `_metadata` parquet file created by systems like Dask and Spark, extract the metadata of all fragments from said file, and populate each fragment with extra statistics for each columns. The `_metadata` file can be created via `pyarrow.parquet.write_metadata`. When the Scan operation is materialised, the row groups of the ParquetFileFragment are elided with the statistics _before_ reading the original file metadata. If no RowGroups from a file matches the predicate of the Scan, the file will not be read (including the metadata footer), thus avoiding expensive IO calls. The optimisation benefits are inversely proportional to the predicate's selectivity. ```python # With the plain FileSystemDataset %timeit t = nyc_tlc_fs_dataset.to_table(filter=da.field('total_amount') > 1000.0, ...) 1.55 s ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # With ParquetDatasetFactory %timeit t = nyc_tlc_parquet_dataset.to_table(filter=da.field('total_amount') > 1000.0, ...) 336 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` - Implement ParquetDatasetFactory - Replace ParquetFileFormat::GetRowGroupFragments with ParquetFileFragment::SplitByRowGroup (and the corresponding bindings). - Add various optimizations, notably in ColumnChunkStatisticsAsExpression. - Consolidate RowGroupSkipper logic in ParquetFileFragment::ScanFile - Ensure FileMetaData::AppendRowGroups checks for schema equality. - Implement dataset._parquet_dataset Closes apache#7180 from fsaintjacques/ARROW-8062-parquet-dataset-metadata Lead-authored-by: François Saint-Jacques <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: François Saint-Jacques <[email protected]>

Follow-up on ARROW-8062 (#7180) Closes #7345 from jorisvandenbossche/ARROW-8946-metadata-write Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: François Saint-Jacques <[email protected]>

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch from ec0a3c2 to 99c51ca Compare May 14, 2020 19:01

fsaintjacques requested review from bkietz and jorisvandenbossche May 18, 2020 17:58

fsaintjacques commented May 18, 2020

View reviewed changes

cpp/src/parquet/arrow/reader_internal.cc Show resolved Hide resolved

fsaintjacques requested a review from xhochy May 18, 2020 18:06

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch 6 times, most recently from aea5db3 to 2bb5686 Compare May 18, 2020 20:33

jorisvandenbossche reviewed May 19, 2020

View reviewed changes

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch from 3fa0657 to fd5a4a3 Compare May 20, 2020 17:06

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch from fd5a4a3 to 8b301d3 Compare May 20, 2020 17:11

fsaintjacques and others added 3 commits May 20, 2020 14:27

Add RowGroupInfo

40d27d1

add parquet_dataset tests

3ef3688

Fix race in Scanner::ToTable

3b85adc

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch from 8b301d3 to 3b85adc Compare May 20, 2020 18:27

Add ParquetDatasetFactory support to example

7bd26ee

bkietz requested changes May 20, 2020

View reviewed changes

fsaintjacques commented May 21, 2020

View reviewed changes

Address comments

d0a3e3b

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch 2 times, most recently from dcf2a68 to 100a7b0 Compare May 21, 2020 15:40

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch from 100a7b0 to cca91b3 Compare May 21, 2020 19:12

fsaintjacques marked this pull request as ready for review May 21, 2020 21:20

jorisvandenbossche reviewed May 23, 2020

View reviewed changes

python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved

Fix windows with normalized path

29f44d9

fsaintjacques force-pushed the ARROW-8062-parquet-dataset-metadata branch from 8df15e3 to 29f44d9 Compare May 25, 2020 18:42

fsaintjacques closed this in 6716bbd May 25, 2020

jorisvandenbossche mentioned this pull request May 26, 2020

to_parquet failing on pyarrow master when different partitions result in different parquet schema dask/dask#6243

Closed

jorisvandenbossche mentioned this pull request Jun 3, 2020

ARROW-8946: [Python] Add tests for parquet.write_metadata #7345

Closed

This was referenced May 27, 2020

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

Closed

[C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error #25010

Closed

rok mentioned this pull request Jun 6, 2024

DRAFT: Parquet 3 metadata with decoupled column metadata apache/parquet-format#242

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8062: [C++][Dataset] Implement ParquetDatasetFactory #7180

ARROW-8062: [C++][Dataset] Implement ParquetDatasetFactory #7180

fsaintjacques commented May 14, 2020 •

edited

Loading

github-actions bot commented May 14, 2020

jorisvandenbossche left a comment

jorisvandenbossche May 19, 2020

fsaintjacques May 19, 2020 •

edited

Loading

jorisvandenbossche May 20, 2020

jorisvandenbossche May 19, 2020

fsaintjacques May 19, 2020

fsaintjacques commented May 19, 2020

jorisvandenbossche commented May 19, 2020

fsaintjacques commented May 19, 2020

jorisvandenbossche commented May 19, 2020

bkietz left a comment

bkietz May 18, 2020

bkietz May 18, 2020

fsaintjacques May 20, 2020

bkietz May 20, 2020

fsaintjacques May 21, 2020

jorisvandenbossche May 26, 2020

	Split the fragment in multiple fragments.
	Split the fragment into multiple fragments.

ARROW-8062: [C++][Dataset] Implement ParquetDatasetFactory #7180

ARROW-8062: [C++][Dataset] Implement ParquetDatasetFactory #7180

Conversation

fsaintjacques commented May 14, 2020 • edited Loading

github-actions bot commented May 14, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fsaintjacques May 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fsaintjacques commented May 19, 2020

jorisvandenbossche commented May 19, 2020

fsaintjacques commented May 19, 2020

jorisvandenbossche commented May 19, 2020

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fsaintjacques commented May 14, 2020 •

edited

Loading

fsaintjacques May 19, 2020 •

edited

Loading