Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve
ArrowWriter
memory usage: Buffer Pages in ArrowWriter instead of RecordBatch (#3871) #4280Improve
ArrowWriter
memory usage: Buffer Pages in ArrowWriter instead of RecordBatch (#3871) #4280Changes from 1 commit
2ce3ecc
a3c4fc3
da4e07e
b09c09b
4683a30
5fa4843
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold -- I am having some trouble with this API when using it in IOx -- it is always returning zero (see https://github.com/influxdata/influxdb_iox/pull/7880/files#r1207315133)
I wrote a test for these APIs targeting this PR in tustvold#63 perhaps you can have a look at tell me if I am doing something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, this is an oversight on my part. The reported size will just be the size of the flushed pages, any data buffered but not yet flushed to a page will not be counted. This should be at most 1 page per column, and so should be less than the max page size (1MB) per column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this API return the upper bound? Otherwise I am not sure how useful it is 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear here, the usecase is "I want to ensure that when writing many parquet files concurrently we don't exceed some memory limit (so that the process doing so isn't killed by k8s / the operating system)
This doesn't need to be super accurate, but enough of an upper bound to achieve the above goal.
If it would be ok to add a 1MB overhead for each column (e.g. the PAGE_SIZE)or wherever that buffer is defined, I can try and propose a patch to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think about how best to support an upper bound, we don't currently expose this information from the encoders. I had assumed, perhaps incorrectly, that an order of magnitude value was sufficient. Ultimately if your at a point where an individual page matters, you're going to have problems...
If the motivation is testing, you could always artificially lower the maximum page size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've inclued tustvold#63 in 4683a30 along with a more accurate form of memory estimation - PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left this pub(crate) as I'm not sure about the returned value, it is potentially misleading as it doesn't include the impact of any block compression on the size of the final output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the encoders all buffer encoded data internally, this is a fairly accurate estimation of the memory usage.
Edit: it will still be an underestimate as the allocations may be larger than needed, but this is fairly accurate, and doesn't require any breaking changes