Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change native parquet writer to write v1 parquet files #9611

Merged
merged 1 commit into from
Oct 20, 2021

Conversation

joshthoward
Copy link
Member

@joshthoward joshthoward commented Oct 12, 2021

This PR instead adding a toggle as in #9497, this PR has the native parquet writer write v1 files by default.

It's a longer discussion to determine if we want to support both v1 and v2; this just fixes the known bugs.

fixes #6377

@findepi
Copy link
Member

findepi commented Oct 13, 2021

i rerun CI with tests:hive.

@alexjo2144
Copy link
Member

There are V2 headers and statistics written in https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java

I'd assume those also need to be changed?

@joshthoward
Copy link
Member Author

@alexjo2144 If you mean specifically this, I mentioned in #9497 that Spark fails to read a V1 header. Statistics are the same between V1 and V2 according to https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift.

@findepi findepi requested a review from alexjo2144 October 13, 2021 15:07
@alexjo2144
Copy link
Member

Spark is probably failing to read the V1 header because something else needs to be updated. There's a comment here implying that this change is not enough from @anjalinorwood #7953 (comment)

Maybe she or @rdblue can point out the spec differences?

There's also the EncodingStats I added this week which uses DATA_PAGE_V2 to indicate v2 pages: https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java#L187

V1 also included the rlEncoding and dlEncoding in the ColumnChunkMetaData encodings set, may be why Spark didn't read the data correctly.

@alexjo2144
Copy link
Member

Comparing the differences between writeDataPage and writeDataPageV2 in the parquet-mr implementation shows that change with the encodings, and potentially more. I haven't taken that close of a look yet: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L508

@findepi findepi dismissed their stale review October 14, 2021 07:51

per Alex's comments

@joshthoward joshthoward force-pushed the jh/parquet-v1 branch 3 times, most recently from 65dbf87 to 212f297 Compare October 19, 2021 00:07
@joshthoward joshthoward requested a review from findepi October 19, 2021 02:29
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks valid. Though I did not follow deeply discussion which led it this stage, so I not comfortable ✅ ing it

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed. lgtm but i'm not a Parquet expert.

@findepi
Copy link
Member

findepi commented Oct 19, 2021

@alexjo2144 @martint @rdblue would you like to take a look?

@joshthoward
Copy link
Member Author

I'm adding the release blocker label so that this gets actually merged prior to the release.

@findepi
Copy link
Member

findepi commented Oct 20, 2021

Thanks @alexjo2144 @martint for the review.

@joshthoward it cannot be considered a release blocker.

@findepi findepi merged commit cd52526 into trinodb:master Oct 20, 2021
@findepi findepi mentioned this pull request Oct 20, 2021
12 tasks
@joshthoward joshthoward deleted the jh/parquet-v1 branch October 20, 2021 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Native Parquet writer creates files that cannot be read by Hive and Spark
5 participants