-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access metadata of flushed row groups on write (#1691) #1774
Access metadata of flushed row groups on write (#1691) #1774
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1774 +/- ##
==========================================
- Coverage 83.39% 83.39% -0.01%
==========================================
Files 198 198
Lines 56142 56212 +70
==========================================
+ Hits 46821 46876 +55
- Misses 9321 9336 +15
Continue to review full report at Codecov.
|
6c69397
to
4cad205
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend some sort of test that uses this API (maybe also demonstrating its use for adaptive writing)
@tustvold For now getting the
Maybe this interface would help me simplify things? Or is there a way to be more "precise" in the resulting file size. Anyhow I do not see it at the moment. Cheers, Markus |
It would theoretically allow for more complex heuristics, but ultimately it is just a different way to access the data returned by row_group_writer.close()
Not easily, it is very hard to predict the encoded size accurately, especially with block compression. The way this is handled internally for pages is writing in small batches, and tracking the size as they go. This is much the same as your approach for row groups. TLDR I think your approach sounds sensible |
🎉 |
Which issue does this PR close?
Closes #1691
@pacman82 please let me know if this provides the necessary functionality
Rationale for this change
Provides information on data that has already been written, to allow for adaptive writing policies
What changes are included in this PR?
Adds the ability to fetch metadata about flushed row groups
Are there any user-facing changes?
No