Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access metadata of flushed row groups on write (#1691) #1774

Merged
merged 2 commits into from
Jun 6, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 2, 2022

Which issue does this PR close?

Closes #1691

@pacman82 please let me know if this provides the necessary functionality

Rationale for this change

Provides information on data that has already been written, to allow for adaptive writing policies

What changes are included in this PR?

Adds the ability to fetch metadata about flushed row groups

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 2, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2022

Codecov Report

Merging #1774 (a3701af) into master (c1a91dc) will decrease coverage by 0.00%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master    #1774      +/-   ##
==========================================
- Coverage   83.39%   83.39%   -0.01%     
==========================================
  Files         198      198              
  Lines       56142    56212      +70     
==========================================
+ Hits        46821    46876      +55     
- Misses       9321     9336      +15     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_writer.rs 97.53% <0.00%> (-0.24%) ⬇️
parquet/src/file/metadata.rs 95.12% <ø> (ø)
parquet/src/schema/types.rs 83.73% <ø> (ø)
parquet/src/file/writer.rs 92.92% <100.00%> (+0.06%) ⬆️
parquet/src/util/cursor.rs 62.18% <0.00%> (-1.69%) ⬇️
parquet/src/file/serialized_reader.rs 94.46% <0.00%> (-1.17%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.46% <0.00%> (-0.20%) ⬇️
arrow/src/ipc/reader.rs 90.73% <0.00%> (-0.11%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1a91dc...a3701af. Read the comment docs.

@tustvold tustvold force-pushed the access-metadata-flushed-row-groups branch from 6c69397 to 4cad205 Compare June 2, 2022 16:34
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend some sort of test that uses this API (maybe also demonstrating its use for adaptive writing)

@pacman82
Copy link

pacman82 commented Jun 6, 2022

@tustvold For now getting the compressed_size of each row group after I've written it, did the trick for me. End to end my usecase is about creating files of roughly the same size, while streaming data from a database. My current solution is like this:

  1. Accumulate the compressed size of each row group written
  2. If sum of compressed size goes over a threshold, reset to zero and start writing the next row group into a new file.

Maybe this interface would help me simplify things? Or is there a way to be more "precise" in the resulting file size. Anyhow I do not see it at the moment.

Cheers, Markus

@tustvold
Copy link
Contributor Author

tustvold commented Jun 6, 2022

Maybe this interface would help me simplify things

It would theoretically allow for more complex heuristics, but ultimately it is just a different way to access the data returned by row_group_writer.close()

Or is there a way to be more "precise" in the resulting file size

Not easily, it is very hard to predict the encoded size accurately, especially with block compression. The way this is handled internally for pages is writing in small batches, and tracking the size as they go. This is much the same as your approach for row groups. TLDR I think your approach sounds sensible

@tustvold tustvold merged commit d4df1d9 into apache:master Jun 6, 2022
@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make current position available in FileWriter.
5 participants