Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Thrift Transport for Parquet Metadata Access #4160

Conversation

liushengxuan
Copy link
Contributor

@liushengxuan liushengxuan commented Mar 1, 2023

This PR refactors the Thrift Transport for Parquet Metadata access. It uses ThriftTransport as an interface and introduces ThriftBufferedTransport and ThriftStreamingTransport.

ThriftStreamingTransport takes in a SeekableInputStream as input for Thrift parsing. This can be used for Parquet Page Header parsing. This optimization is able to reduce the deep copy in readPageHeader(). And it is also the prerequisite to fix the incorrect page header length issue.

ThriftBufferedTransport takes in a consecutive memory space as input for Thrift parsing. This can be used for Parquet Footer parsing, because the footer is at the bottom of the file and we need to also take care of footer length and PAR1, from the bottom to top.

@netlify
Copy link

netlify bot commented Mar 1, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit df4a80c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6402361203a52400086f49cd

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 1, 2023
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch 5 times, most recently from 43480e7 to 3373687 Compare March 1, 2023 02:08
@liushengxuan liushengxuan changed the title [WIP] Refactor Thrift Transport Refactor Thrift Transport for Parquet Metadata Access Mar 1, 2023
@liushengxuan
Copy link
Contributor Author

@Yuhta @yingsu00

@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch from 3373687 to 0758a1d Compare March 1, 2023 05:46
@Yuhta Yuhta self-requested a review March 1, 2023 14:33
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch 3 times, most recently from c59787d to b18a681 Compare March 1, 2023 19:23
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch from b18a681 to 531f3b4 Compare March 1, 2023 22:04
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch from 531f3b4 to 9471f36 Compare March 2, 2023 08:16
@yingsu00 yingsu00 self-requested a review March 2, 2023 09:51
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch 2 times, most recently from 79f7029 to d9560c6 Compare March 2, 2023 10:43
@liushengxuan liushengxuan requested review from leoluan2009, yingsu00 and Yuhta and removed request for leoluan2009, yingsu00 and Yuhta March 2, 2023 10:48
@liushengxuan liushengxuan requested review from Yuhta, yingsu00 and leoluan2009 and removed request for yingsu00 and Yuhta March 2, 2023 10:49
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch 3 times, most recently from ad03558 to 4ef5b02 Compare March 3, 2023 07:47
Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liushengxuan Will you be able to update the footer reading using the new transport as well? If yes we don't need to keep the old transport.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch from 4ef5b02 to 6204515 Compare March 3, 2023 18:00
This PR refactors the Thrift Transport for Parquet Metadata access. It uses
ThriftTransport as an interface and introduces ThriftBufferedTransport and
ThriftStreamingTransport.

ThriftStreamingTransport takes in a SeekableInputStream as input for Thrift
parsing. This can be used for Parquet Page Header parsing. This
optimization is able to reduce the deep copy in readPageHeader(). And it is
also the prerequisite to fix the incorrect page header length issue.

ThriftBufferedTransport takes in a consecutive memory space as input for
Thrift parsing. This can be used for Parquet Footer parsing, because the
footer is at the bottom of the file and we need to also take care of footer
length and PAR1, from the bottom to top.
@liushengxuan liushengxuan force-pushed the shengxuan/refactor_thrift_transport branch from 6204515 to df4a80c Compare March 3, 2023 18:01
@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 39977b1.

@liushengxuan liushengxuan deleted the shengxuan/refactor_thrift_transport branch March 4, 2023 05:34
@liushengxuan liushengxuan restored the shengxuan/refactor_thrift_transport branch March 4, 2023 05:34
zhejiangxiaomai pushed a commit to oap-project/velox that referenced this pull request Mar 8, 2023
…g data (#145)

* Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160

* Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
zhejiangxiaomai pushed a commit to zhejiangxiaomai/velox that referenced this pull request Mar 8, 2023
…g data (oap-project#145)

* Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160

* Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
zhejiangxiaomai pushed a commit to zhejiangxiaomai/velox that referenced this pull request Mar 8, 2023
…g data (oap-project#145)

* Port a patch: Refactor Thrift Transport for Parquet Metadata Access facebookincubator#4160

* Port a patch: Read Parquet Page Header with ThriftStreamingTransport to Fix the Incorrect Header Length facebookincubator#4108
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Mar 27, 2023
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants