-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve VarBinaryDecoder::Take performance by accumulating small batches #239
Conversation
ccb1d88
to
0340337
Compare
ARROW_ASSIGN_OR_RAISE(auto positions, ReadPositions(start, length)); | ||
assert(positions->length() == length + 1); | ||
|
||
const int64_t kMinimalBatchBytes = 128 * 1024; // 128K |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be enforced elsewhere? if so, do we need a global constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an optimization, does not change behavior.
Long term this could be a runtime configuration tho.
template <ArrowType T> | ||
::arrow::Result<std::shared_ptr<::arrow::Int64Array>> VarBinaryDecoder<T>::ReadPositions( | ||
int32_t start, int32_t length) const { | ||
auto buf = infile_->ReadAt(position_ + start * sizeof(int64_t), (length + 1) * sizeof(int64_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to pass in length + 1 here? Isn't length * sizeof(int64_t) the number of bytes that need to be read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for an array like ["aa", "bbb", "cccc"]
, the offset array should look like [0, 2, 5, 9]
, where [5, 9]
is used to calculate the start and end of "cccc" string.
Improve query
SELECT external_image FROM oxford_pet WHERE class = 'pug'
by 6X (12s -> 2s) on home computer.