Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length #4108

Conversation

liushengxuan
Copy link
Contributor

@liushengxuan liushengxuan commented Feb 22, 2023

The size of Parquet Page Header remains unknown until the Header is parsed because it is in Thrift format. Without this PR, the size is estimated by sizeof(PageHeader) but the actual Parquet Page Header might be larger. It can be reproduced by using the given largePageHeader. This PR takes the advantage of ThriftStreamingTransport with the input of inputStream_ to avoid estimating the Parquet Page Length. In this case, the incorrect Parquet Page Header length can be resolved.

Also, this approach avoids unnecessary deep copy when reading Parquet Page Headers. Without this PR, deep copying might happen when preparing the input data from in-consecutive chunks for ThriftBufferedTransport.

This PR also introduced the related Parquet page header unit tests on small, large, and corrupted Parquet Page Headers.

@netlify
Copy link

netlify bot commented Feb 22, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 101a05f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6406747f95d9eb0008675c35

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 22, 2023
@liushengxuan liushengxuan changed the title Use adaptive approach to determine the page header size Use adaptive approach to determine the Parquet page header size Feb 22, 2023
@liushengxuan liushengxuan force-pushed the shengxuan/fix_read_page_header_length branch 7 times, most recently from 225da2d to eaf2554 Compare February 22, 2023 06:20
@liushengxuan
Copy link
Contributor Author

@yingsu00 @Yuhta

@liushengxuan liushengxuan force-pushed the shengxuan/fix_read_page_header_length branch 4 times, most recently from 52b866b to 9754282 Compare March 2, 2023 20:31
@liushengxuan liushengxuan changed the title Use adaptive approach to determine the Parquet page header size Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length Mar 2, 2023
@liushengxuan liushengxuan force-pushed the shengxuan/fix_read_page_header_length branch 2 times, most recently from a9265db to 5ebe4cd Compare March 3, 2023 07:50
@liushengxuan liushengxuan force-pushed the shengxuan/fix_read_page_header_length branch 3 times, most recently from b757240 to d346e91 Compare March 6, 2023 18:44
@liushengxuan liushengxuan force-pushed the shengxuan/fix_read_page_header_length branch from d346e91 to 4ad9ed9 Compare March 6, 2023 23:11
@liushengxuan liushengxuan reopened this Mar 6, 2023
…orrect Header Length

The size of Parquet Page Header remains unknown until the Header is parsed
because it is in Thrift format. Without this PR, the size is estimated by
sizeof(PageHeader) but the actual Parquet Page Header might be larger. It
can be reproduced by using the given largePageHeader. This PR takes the
advantage of ThriftStreamingTransport with the input of inputStream_ to
avoid estimating the Parquet Page Length. In this case, the incorrect
Parquet Page Header length can be resolved.

Also, this approach avoids unnecessary deep copy when reading Parquet Page
Headers. Without this PR, deep copying might happen when preparing the
input data from in-consecutive chunks for ThriftBufferedTransport.

This PR also introduced the related Parquet page header unit tests on
small, large, and corrupted Parquet Page Headers.
@liushengxuan liushengxuan force-pushed the shengxuan/fix_read_page_header_length branch from 0e1a458 to 101a05f Compare March 6, 2023 23:17
@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Mar 7, 2023
@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 94905a0.

zhejiangxiaomai pushed a commit to oap-project/velox that referenced this pull request Mar 8, 2023
…g data (#145)

* Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160

* Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
zhejiangxiaomai pushed a commit to zhejiangxiaomai/velox that referenced this pull request Mar 8, 2023
…g data (oap-project#145)

* Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160

* Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
zhejiangxiaomai pushed a commit to zhejiangxiaomai/velox that referenced this pull request Mar 8, 2023
…g data (oap-project#145)

* Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160

* Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
@PHILO-HE
Copy link
Contributor

@liushengxuan, thanks for your work!
Recently, we have verified the patch, no functionality issue. But we observed there is some perf. regression with this patch applied. For TPCH q1, there is around 30% perf. drop. Is there any possible perf. optimization? Thanks!

@Yuhta
Copy link
Contributor

Yuhta commented Mar 21, 2023

@liushengxuan My guess is that it was able to inline the ThriftBufferedTransport::read in old code on hot path, and in new code this is not possible. Do you have time to take a look?

if (bufferEnd_ == bufferStart_) {
const void* buffer;
int32_t size;
inputStream_->Next(&buffer, &size);
bufferStart_ = reinterpret_cast<const char*>(buffer);
bufferEnd_ = bufferStart_ + size;
}
if (bufferEnd_ - bufferStart_ >= sizeof(PageHeader)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should keep this branch for hot path

auto thriftProtocol =
std::make_unique<apache::thrift::protocol::TCompactProtocolT<
thrift::ThriftBufferedTransport>>(thriftTransport);
std::shared_ptr<thrift::ThriftTransport> thriftTransport =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also undo this should probably could make it a little faster

thrift::ThriftBufferedTransport>>(transport);
} else {
dwio::common::readBytes(
std::min<int64_t>(remainingSize, sizeof(PageHeader)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need some larger initial read when reading from SeekableInputStream

@Yuhta
Copy link
Contributor

Yuhta commented Mar 22, 2023

cc: @oerling

@Yuhta
Copy link
Contributor

Yuhta commented Mar 22, 2023

Hi @PHILO-HE, @oerling did run TPCH Q1 on our side and we do not see any perf regression. Can you double check and elaborate a little more about the regression you observed?

@oerling
Copy link
Contributor

oerling commented Mar 22, 2023 via email

@liushengxuan
Copy link
Contributor Author

@PHILO-HE @Yuhta @oerling We also did the performance test on our side. We did not observe the regression. Also, I agrees with @oerling ‘s opinion. Even though there is a regression, reading page headers should only account a very small portion and is not able to make a 30% E2E regression.

@PHILO-HE can you please double check you benchmark testing?

Thanks!

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Apr 6, 2023

Thanks to everyone for your comment!
I tried with velox tpch benchmark, yes there is no perf. regression on my side also. Previously, we tested on spark + velox with modified tpch and dataset with date type replaced by string type. Not sure whether the test differences matter. I will give you feedback if it is reproducible. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet PageReader reads the buffer with the incorrect length
6 participants