-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Support for large parquet files via chunked writes. #4087
[REVIEW] Support for large parquet files via chunked writes. #4087
Conversation
…ng key diffs easier I'm committing this change with the bulk of writer::impl::write() moved into writer::impl::write_chunked() but without any of the internal changes. The next commit will have all the actual logical changes. Hopefully makes PR reviews easier.
…d parquet benchmark to use it. Fixed chunked parquet writer tests to properly use the temp directory for output.
cpp/include/cudf/io/functions.hpp
Outdated
* @param[in] state Opaque state information about the writer process. Must be the same pointer returned | ||
* from write_parquet_chunked_begin() | ||
*/ | ||
void write_parquet_chunked(table_view const& table, std::shared_ptr<detail::parquet::pq_chunked_state>& state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing a shared_ptr
by non-const ref smells fishy. Are you modifying the shared_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, write_parquet_chunked_end() does a reset. The thinking is that the user is just given a widget they ultimately don't need to do anything with except forward along, so the write_chunked_ code does the cleanup itself at the end of the logical operation.
write_parquet_chunked() could use a const though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-opening. Sorry, my brain wasn't functioning when I read this. The write_parquet_chunked() case doesn't need a reference at all. I'll update with the changelog conflict/merge update.
Codecov Report
@@ Coverage Diff @@
## branch-0.13 #4087 +/- ##
===============================================
- Coverage 86.72% 86.67% -0.05%
===============================================
Files 50 50
Lines 9799 9818 +19
===============================================
+ Hits 8498 8510 +12
- Misses 1301 1308 +7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me, just a cleanup on a comment.
Co-Authored-By: Mark Harris <[email protected]>
It looks like the tests are passing here, but running locally I am seeing a failure in one of the python writer tests: test_parquet_writer_gpu_none_index |
Problem identified. I messed up building/running the tests locally. PR is good to go. |
Addresses #3542
Implements a chunked write API that allows a single parquet file containing one logical table file to be written by multiple write_parquet_chunked() calls from a user. This also refactors the non-chunked write_parquet() path to be implemented in terms of a chunked read with 1 chunk.
I also added benchmarking code for both chunked and non-chunked writes. There are some interesting overhead issues with writing many chunks which may be worth investigating and optimizing at a later time.
Key entry points include:
functions.hpp:
write_parquet_chunked_begin
write_parquet_chunked
write_parquet_chunked_end
writers_impl.hpp:
parquet::writer::impl::write_chunked_begin
parquet::writer::impl::write_chunked
parquet::writer::impl::write_chunked_end
State information about the writes is returned via an opaque struct,
pq_chunked_state
wrapped in a shared_ptr (unique_ptr doesn't allow you to return incomplete types in any situation where the destructor would need to be called). I'm not super in love with this name so if anyone has any suggestions, let me know.Performance note : The way the writer works, it needs to make some assumptions about the global contents of a given column containing nulls or not for compression purposes. But in the case of chunked writes we don't have that information - so it assumes the worst case that all columns could have nulls. There is an optimization available for callers that can guarantee ahead of time that all columns chunks they will ever pass in do/do-not contain nulls; they can fill per-column information in the
table_metadata_with_nullability
optional parameter passed in the args field.Final note on diffs : The bulk of non-interface changes here happened in the code that got transplanted from
writer::impl::write_all
towriter::impl::write
. If you were to just look at the final diff, you would a big green blob that misses some detail. To ease this, I've structured the commits so that the initial commit just transplants the code with no changes. If you do a diff on the second commit, labelled "Internal changes to write() code transplanted to write_chunked()" you'll see all the interesting changes.