-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length #4108
Conversation
✅ Deploy Preview for meta-velox canceled.
|
225da2d
to
eaf2554
Compare
52b866b
to
9754282
Compare
a9265db
to
5ebe4cd
Compare
b757240
to
d346e91
Compare
d346e91
to
4ad9ed9
Compare
…orrect Header Length The size of Parquet Page Header remains unknown until the Header is parsed because it is in Thrift format. Without this PR, the size is estimated by sizeof(PageHeader) but the actual Parquet Page Header might be larger. It can be reproduced by using the given largePageHeader. This PR takes the advantage of ThriftStreamingTransport with the input of inputStream_ to avoid estimating the Parquet Page Length. In this case, the incorrect Parquet Page Header length can be resolved. Also, this approach avoids unnecessary deep copy when reading Parquet Page Headers. Without this PR, deep copying might happen when preparing the input data from in-consecutive chunks for ThriftBufferedTransport. This PR also introduced the related Parquet page header unit tests on small, large, and corrupted Parquet Page Headers.
0e1a458
to
101a05f
Compare
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…to Fix the Incorrect Header Length facebookincubator#4108
…g data (#145) * Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160 * Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
…g data (oap-project#145) * Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160 * Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
…g data (oap-project#145) * Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160 * Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
@liushengxuan, thanks for your work! |
@liushengxuan My guess is that it was able to inline the |
if (bufferEnd_ == bufferStart_) { | ||
const void* buffer; | ||
int32_t size; | ||
inputStream_->Next(&buffer, &size); | ||
bufferStart_ = reinterpret_cast<const char*>(buffer); | ||
bufferEnd_ = bufferStart_ + size; | ||
} | ||
if (bufferEnd_ - bufferStart_ >= sizeof(PageHeader)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should keep this branch for hot path
auto thriftProtocol = | ||
std::make_unique<apache::thrift::protocol::TCompactProtocolT< | ||
thrift::ThriftBufferedTransport>>(thriftTransport); | ||
std::shared_ptr<thrift::ThriftTransport> thriftTransport = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also undo this should probably could make it a little faster
thrift::ThriftBufferedTransport>>(transport); | ||
} else { | ||
dwio::common::readBytes( | ||
std::min<int64_t>(remainingSize, sizeof(PageHeader)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably also need some larger initial read when reading from SeekableInputStream
cc: @oerling |
The read of ThriftStreamingTransport occurs at 0.09% in the linux perf profile for running Velox_tpch_benchmark Q1. It can have no bearing on performance.
The function is almost always called with a length of 1. So making a special case for that would make it faster. But it is already way below measurement threshold.
A 30% drop for Q1 is woryth noting but this is not a valid reason for that. What is the linux perf profile for running this?
Thanks
Orri
From: Jimmy Lu ***@***.***>
Sent: Wednesday, March 22, 2023 12:00 PM
To: facebookincubator/velox ***@***.***>
Cc: oerling ***@***.***>; Mention ***@***.***>
Subject: Re: [facebookincubator/velox] Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length (PR #4108)
cc: @oerling <https://github.com/oerling>
—
Reply to this email directly, view it on GitHub <#4108 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKPPPT2PCF6JGZXRQPQFMDTW5ND4RANCNFSM6AAAAAAVD4FCLA> .
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@PHILO-HE @Yuhta @oerling We also did the performance test on our side. We did not observe the regression. Also, I agrees with @oerling ‘s opinion. Even though there is a regression, reading page headers should only account a very small portion and is not able to make a 30% E2E regression. @PHILO-HE can you please double check you benchmark testing? Thanks! |
Thanks to everyone for your comment! |
The size of Parquet Page Header remains unknown until the Header is parsed because it is in Thrift format. Without this PR, the size is estimated by
sizeof(PageHeader)
but the actual Parquet Page Header might be larger. It can be reproduced by using the givenlargePageHeader
. This PR takes the advantage ofThriftStreamingTransport
with the input ofinputStream_
to avoid estimating the Parquet Page Length. In this case, the incorrect Parquet Page Header length can be resolved.Also, this approach avoids unnecessary deep copy when reading Parquet Page Headers. Without this PR, deep copying might happen when preparing the input data from in-consecutive chunks for
ThriftBufferedTransport
.This PR also introduced the related Parquet page header unit tests on small, large, and corrupted Parquet Page Headers.