[REVIEW] Support for large parquet files via chunked writes. #4087

nvdbaranec · 2020-02-06T22:37:10Z

Addresses #3542

Implements a chunked write API that allows a single parquet file containing one logical table file to be written by multiple write_parquet_chunked() calls from a user. This also refactors the non-chunked write_parquet() path to be implemented in terms of a chunked read with 1 chunk.

I also added benchmarking code for both chunked and non-chunked writes. There are some interesting overhead issues with writing many chunks which may be worth investigating and optimizing at a later time.

Key entry points include:

functions.hpp:
write_parquet_chunked_begin
write_parquet_chunked
write_parquet_chunked_end

writers_impl.hpp:
parquet::writer::impl::write_chunked_begin
parquet::writer::impl::write_chunked
parquet::writer::impl::write_chunked_end

State information about the writes is returned via an opaque struct, pq_chunked_state wrapped in a shared_ptr (unique_ptr doesn't allow you to return incomplete types in any situation where the destructor would need to be called). I'm not super in love with this name so if anyone has any suggestions, let me know.

Performance note : The way the writer works, it needs to make some assumptions about the global contents of a given column containing nulls or not for compression purposes. But in the case of chunked writes we don't have that information - so it assumes the worst case that all columns could have nulls. There is an optimization available for callers that can guarantee ahead of time that all columns chunks they will ever pass in do/do-not contain nulls; they can fill per-column information in the table_metadata_with_nullability optional parameter passed in the args field.

Final note on diffs : The bulk of non-interface changes here happened in the code that got transplanted from writer::impl::write_all to writer::impl::write. If you were to just look at the final diff, you would a big green blob that misses some detail. To ease this, I've structured the commits so that the initial commit just transplants the code with no changes. If you do a diff on the second commit, labelled "Internal changes to write() code transplanted to write_chunked()" you'll see all the interesting changes.

…ng key diffs easier I'm committing this change with the bulk of writer::impl::write() moved into writer::impl::write_chunked() but without any of the internal changes. The next commit will have all the actual logical changes. Hopefully makes PR reviews easier.

…hunked writer.

…d parquet benchmark to use it. Fixed chunked parquet writer tests to properly use the temp directory for output.

jrhemstad · 2020-02-06T22:47:14Z

cpp/include/cudf/io/functions.hpp

+ * @param[in] state Opaque state information about the writer process. Must be the same pointer returned
+ *            from write_parquet_chunked_begin()
+ */                                            
+void write_parquet_chunked(table_view const& table, std::shared_ptr<detail::parquet::pq_chunked_state>& state);


Passing a shared_ptr by non-const ref smells fishy. Are you modifying the shared_ptr?

https://www.modernescpp.com/index.php/c-core-guidelines-passing-smart-pointer

Ultimately, write_parquet_chunked_end() does a reset. The thinking is that the user is just given a widget they ultimately don't need to do anything with except forward along, so the write_chunked_ code does the cleanup itself at the end of the logical operation.

write_parquet_chunked() could use a const though.

Re-opening. Sorry, my brain wasn't functioning when I read this. The write_parquet_chunked() case doesn't need a reference at all. I'll update with the changelog conflict/merge update.

codecov · 2020-02-07T00:24:20Z

Codecov Report

Merging #4087 into branch-0.13 will decrease coverage by 0.04%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##           branch-0.13    #4087      +/-   ##
===============================================
- Coverage        86.72%   86.67%   -0.05%     
===============================================
  Files               50       50              
  Lines             9799     9818      +19     
===============================================
+ Hits              8498     8510      +12     
- Misses            1301     1308       +7

Impacted Files	Coverage Δ
python/cudf/cudf/comm/serialize.py	`0% <ø> (ø)`	⬆️
python/cudf/cudf/core/buffer.py	`87.5% <100%> (ø)`	⬆️
python/cudf/cudf/core/dataframe.py	`91.6% <100%> (+0.01%)`	⬆️
python/cudf/cudf/core/column/string.py	`93.71% <100%> (ø)`	⬆️
python/cudf/cudf/core/column/column.py	`86.58% <100%> (ø)`	⬆️
python/cudf/cudf/core/index.py	`89.13% <100%> (+0.18%)`	⬆️
python/cudf/cudf/core/dtypes.py	`95% <100%> (ø)`	⬆️
python/cudf/cudf/core/series.py	`92.51% <100%> (ø)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`96.23% <100%> (ø)`	⬆️
python/cudf/cudf/core/multiindex.py	`83.87% <100%> (+0.28%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6132d7a...110057a. Read the comment docs.

OlivierNV

Looks good to me

…taining opaque state.

jlowe

Looks OK to me, just a cleanup on a comment.

cpp/include/cudf/io/types.hpp

cpp/include/cudf/io/writers.hpp

cpp/tests/io/parquet_test.cu

Co-Authored-By: Mark Harris <[email protected]>

… end of parquet_test.cu

cpp/include/cudf/io/types.hpp

nvdbaranec · 2020-02-14T22:22:55Z

It looks like the tests are passing here, but running locally I am seeing a failure in one of the python writer tests: test_parquet_writer_gpu_none_index

nvdbaranec · 2020-02-14T23:16:46Z

Problem identified. I messed up building/running the tests locally. PR is good to go.

nvdbaranec added 6 commits February 5, 2020 15:46

Internal changes to write() code transplanted to write_chunked().

e80765e

Added basic parquet writer benchmarks. Fixed an indexing bug in the c…

4240a03

…hunked writer.

Added void io sink type (one that does no actual physical io). Change…

23489d0

…d parquet benchmark to use it. Fixed chunked parquet writer tests to properly use the temp directory for output.

Add benchmarks for chunked writes.

37fc0a3

Merge branch 'branch-0.13' into parquet_large_files

2630bd9

nvdbaranec added 3 - Ready for Review Ready for review by team cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Feb 6, 2020

nvdbaranec requested review from jlowe, vuule and OlivierNV February 6, 2020 22:37

nvdbaranec requested review from a team as code owners February 6, 2020 22:37

nvdbaranec added 2 commits February 6, 2020 16:39

Changelog for rapidsai#4087

724c3aa

Fixed typo in changelog PR #

0514457

jrhemstad reviewed Feb 6, 2020

View reviewed changes

OlivierNV approved these changes Feb 7, 2020

View reviewed changes

nvdbaranec added 2 commits February 12, 2020 14:19

Merge branch 'branch-0.13' into parquet_large_files

999a1a4

Change write_parquet_chunked() to take a const& to the shared_ptr con…

cc16fa1

…taining opaque state.

OlivierNV approved these changes Feb 12, 2020

View reviewed changes

jlowe approved these changes Feb 12, 2020

View reviewed changes

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/io/writers.hpp Outdated Show resolved Hide resolved

Cleaned up some comments and formatting.

1c1cce4

harrism approved these changes Feb 14, 2020

View reviewed changes

cpp/tests/io/parquet_test.cu Outdated Show resolved Hide resolved

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 14, 2020

Update cpp/tests/io/parquet_test.cu

6132d7a

Co-Authored-By: Mark Harris <[email protected]>

kkraus14 and others added 2 commits February 14, 2020 11:36

Merge branch 'branch-0.13' into parquet_large_files

1c8a036

Removed unnecessary/dumb const& to a shared_ptr. Add a newline to the…

0bcdc66

… end of parquet_test.cu

OlivierNV reviewed Feb 14, 2020

View reviewed changes

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved

Fix merge issue. Make default value for sink_info::type to be VOID.

110057a

nvdbaranec changed the title ~~[REVIEW] Support for large parquet files via chunked writes.~~ [WIP] Support for large parquet files via chunked writes. Feb 14, 2020

nvdbaranec removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 14, 2020

kkraus14 added the 2 - In Progress Currently a work in progress label Feb 14, 2020

nvdbaranec changed the title ~~[WIP] Support for large parquet files via chunked writes.~~ [REVIEW] Support for large parquet files via chunked writes. Feb 14, 2020

nvdbaranec added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Feb 14, 2020

OlivierNV merged commit 6fe4c47 into rapidsai:branch-0.13 Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Support for large parquet files via chunked writes. #4087

[REVIEW] Support for large parquet files via chunked writes. #4087

nvdbaranec commented Feb 6, 2020

jrhemstad Feb 6, 2020

jrhemstad Feb 6, 2020

nvdbaranec Feb 6, 2020 •

edited

Loading

nvdbaranec Feb 14, 2020 •

edited

Loading

codecov bot commented Feb 7, 2020 •

edited

Loading

OlivierNV left a comment

jlowe left a comment

nvdbaranec commented Feb 14, 2020

nvdbaranec commented Feb 14, 2020 •

edited

Loading

[REVIEW] Support for large parquet files via chunked writes. #4087

[REVIEW] Support for large parquet files via chunked writes. #4087

Conversation

nvdbaranec commented Feb 6, 2020

jrhemstad Feb 6, 2020

Choose a reason for hiding this comment

jrhemstad Feb 6, 2020

Choose a reason for hiding this comment

nvdbaranec Feb 6, 2020 • edited Loading

Choose a reason for hiding this comment

nvdbaranec Feb 14, 2020 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 7, 2020 • edited Loading

Codecov Report

OlivierNV left a comment

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

nvdbaranec commented Feb 14, 2020

nvdbaranec commented Feb 14, 2020 • edited Loading

nvdbaranec Feb 6, 2020 •

edited

Loading

nvdbaranec Feb 14, 2020 •

edited

Loading

codecov bot commented Feb 7, 2020 •

edited

Loading

nvdbaranec commented Feb 14, 2020 •

edited

Loading