Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Support for large parquet files via chunked writes. #4087

Merged
merged 15 commits into from
Feb 15, 2020

Conversation

nvdbaranec
Copy link
Contributor

Addresses #3542

Implements a chunked write API that allows a single parquet file containing one logical table file to be written by multiple write_parquet_chunked() calls from a user. This also refactors the non-chunked write_parquet() path to be implemented in terms of a chunked read with 1 chunk.

I also added benchmarking code for both chunked and non-chunked writes. There are some interesting overhead issues with writing many chunks which may be worth investigating and optimizing at a later time.

Key entry points include:

functions.hpp:
write_parquet_chunked_begin
write_parquet_chunked
write_parquet_chunked_end

writers_impl.hpp:
parquet::writer::impl::write_chunked_begin
parquet::writer::impl::write_chunked
parquet::writer::impl::write_chunked_end

State information about the writes is returned via an opaque struct, pq_chunked_state wrapped in a shared_ptr (unique_ptr doesn't allow you to return incomplete types in any situation where the destructor would need to be called). I'm not super in love with this name so if anyone has any suggestions, let me know.

Performance note : The way the writer works, it needs to make some assumptions about the global contents of a given column containing nulls or not for compression purposes. But in the case of chunked writes we don't have that information - so it assumes the worst case that all columns could have nulls. There is an optimization available for callers that can guarantee ahead of time that all columns chunks they will ever pass in do/do-not contain nulls; they can fill per-column information in the table_metadata_with_nullability optional parameter passed in the args field.

Final note on diffs : The bulk of non-interface changes here happened in the code that got transplanted from writer::impl::write_all to writer::impl::write. If you were to just look at the final diff, you would a big green blob that misses some detail. To ease this, I've structured the commits so that the initial commit just transplants the code with no changes. If you do a diff on the second commit, labelled "Internal changes to write() code transplanted to write_chunked()" you'll see all the interesting changes.

…ng key diffs easier I'm committing this change with the bulk

of writer::impl::write() moved into writer::impl::write_chunked() but without any of the internal changes.  The next commit will
have all the actual logical changes.  Hopefully makes PR reviews easier.
…d parquet benchmark to use it. Fixed chunked parquet writer

tests to properly use the temp directory for output.
@nvdbaranec nvdbaranec added 3 - Ready for Review Ready for review by team cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Feb 6, 2020
@nvdbaranec nvdbaranec requested review from a team as code owners February 6, 2020 22:37
* @param[in] state Opaque state information about the writer process. Must be the same pointer returned
* from write_parquet_chunked_begin()
*/
void write_parquet_chunked(table_view const& table, std::shared_ptr<detail::parquet::pq_chunked_state>& state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing a shared_ptr by non-const ref smells fishy. Are you modifying the shared_ptr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@nvdbaranec nvdbaranec Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, write_parquet_chunked_end() does a reset. The thinking is that the user is just given a widget they ultimately don't need to do anything with except forward along, so the write_chunked_ code does the cleanup itself at the end of the logical operation.

write_parquet_chunked() could use a const though.

Copy link
Contributor Author

@nvdbaranec nvdbaranec Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-opening. Sorry, my brain wasn't functioning when I read this. The write_parquet_chunked() case doesn't need a reference at all. I'll update with the changelog conflict/merge update.

@codecov
Copy link

codecov bot commented Feb 7, 2020

Codecov Report

Merging #4087 into branch-0.13 will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.13    #4087      +/-   ##
===============================================
- Coverage        86.72%   86.67%   -0.05%     
===============================================
  Files               50       50              
  Lines             9799     9818      +19     
===============================================
+ Hits              8498     8510      +12     
- Misses            1301     1308       +7
Impacted Files Coverage Δ
python/cudf/cudf/comm/serialize.py 0% <ø> (ø) ⬆️
python/cudf/cudf/core/buffer.py 87.5% <100%> (ø) ⬆️
python/cudf/cudf/core/dataframe.py 91.6% <100%> (+0.01%) ⬆️
python/cudf/cudf/core/column/string.py 93.71% <100%> (ø) ⬆️
python/cudf/cudf/core/column/column.py 86.58% <100%> (ø) ⬆️
python/cudf/cudf/core/index.py 89.13% <100%> (+0.18%) ⬆️
python/cudf/cudf/core/dtypes.py 95% <100%> (ø) ⬆️
python/cudf/cudf/core/series.py 92.51% <100%> (ø) ⬆️
python/cudf/cudf/core/groupby/groupby.py 96.23% <100%> (ø) ⬆️
python/cudf/cudf/core/multiindex.py 83.87% <100%> (+0.28%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6132d7a...110057a. Read the comment docs.

Copy link
Contributor

@OlivierNV OlivierNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, just a cleanup on a comment.

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 14, 2020
@nvdbaranec
Copy link
Contributor Author

It looks like the tests are passing here, but running locally I am seeing a failure in one of the python writer tests: test_parquet_writer_gpu_none_index

@nvdbaranec nvdbaranec changed the title [REVIEW] Support for large parquet files via chunked writes. [WIP] Support for large parquet files via chunked writes. Feb 14, 2020
@nvdbaranec nvdbaranec removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 14, 2020
@kkraus14 kkraus14 added the 2 - In Progress Currently a work in progress label Feb 14, 2020
@nvdbaranec nvdbaranec changed the title [WIP] Support for large parquet files via chunked writes. [REVIEW] Support for large parquet files via chunked writes. Feb 14, 2020
@nvdbaranec nvdbaranec added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Feb 14, 2020
@nvdbaranec
Copy link
Contributor Author

nvdbaranec commented Feb 14, 2020

Problem identified. I messed up building/running the tests locally. PR is good to go.

@OlivierNV OlivierNV merged commit 6fe4c47 into rapidsai:branch-0.13 Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants