Change native parquet writer to write v1 parquet files #9611

joshthoward · 2021-10-12T22:14:58Z

This PR instead adding a toggle as in #9497, this PR has the native parquet writer write v1 files by default.

It's a longer discussion to determine if we want to support both v1 and v2; this just fixes the known bugs.

fixes #6377

testing/trino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveCompression.java

...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java

findepi · 2021-10-13T09:01:37Z

i rerun CI with tests:hive.

alexjo2144 · 2021-10-13T14:20:31Z

There are V2 headers and statistics written in https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java

I'd assume those also need to be changed?

joshthoward · 2021-10-13T14:28:18Z

@alexjo2144 If you mean specifically this, I mentioned in #9497 that Spark fails to read a V1 header. Statistics are the same between V1 and V2 according to https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift.

alexjo2144 · 2021-10-13T16:46:13Z

Spark is probably failing to read the V1 header because something else needs to be updated. There's a comment here implying that this change is not enough from @anjalinorwood #7953 (comment)

Maybe she or @rdblue can point out the spec differences?

There's also the EncodingStats I added this week which uses DATA_PAGE_V2 to indicate v2 pages: https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java#L187

V1 also included the rlEncoding and dlEncoding in the ColumnChunkMetaData encodings set, may be why Spark didn't read the data correctly.

alexjo2144 · 2021-10-13T16:52:40Z

Comparing the differences between writeDataPage and writeDataPageV2 in the parquet-mr implementation shows that change with the encodings, and potentially more. I haven't taken that close of a look yet: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L508

per Alex's comments

losipiuk

Looks valid. Though I did not follow deeply discussion which led it this stage, so I not comfortable ✅ ing it

findepi

skimmed. lgtm but i'm not a Parquet expert.

lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java

findepi · 2021-10-19T11:36:24Z

@alexjo2144 @martint @rdblue would you like to take a look?

joshthoward · 2021-10-20T11:41:07Z

I'm adding the release blocker label so that this gets actually merged prior to the release.

findepi · 2021-10-20T15:19:18Z

Thanks @alexjo2144 @martint for the review.

@joshthoward it cannot be considered a release blocker.

Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>

cla-bot bot added the cla-signed label Oct 12, 2021

joshthoward requested review from findepi and losipiuk October 12, 2021 22:15

findepi previously approved these changes Oct 13, 2021

View reviewed changes

testing/trino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveCompression.java Show resolved Hide resolved

...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java Show resolved Hide resolved

findepi added the tests:hive label Oct 13, 2021

findepi requested a review from alexjo2144 October 13, 2021 15:07

joshthoward force-pushed the jh/parquet-v1 branch 3 times, most recently from 65dbf87 to 212f297 Compare October 19, 2021 00:07

joshthoward requested a review from findepi October 19, 2021 02:29

losipiuk reviewed Oct 19, 2021

View reviewed changes

findepi reviewed Oct 19, 2021

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java Outdated Show resolved Hide resolved

lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java Show resolved Hide resolved

Change native parquet writer to write v1 parquet files

df54132

joshthoward force-pushed the jh/parquet-v1 branch from 212f297 to df54132 Compare October 19, 2021 18:05

martint approved these changes Oct 19, 2021

View reviewed changes

joshthoward added the RELEASE-BLOCKER label Oct 20, 2021

alexjo2144 approved these changes Oct 20, 2021

View reviewed changes

findepi removed the RELEASE-BLOCKER label Oct 20, 2021

findepi merged commit cd52526 into trinodb:master Oct 20, 2021

findepi mentioned this pull request Oct 20, 2021

Release notes for 364 #9534

Closed

12 tasks

joshthoward deleted the jh/parquet-v1 branch October 20, 2021 15:49

github-actions bot added this to the 364 milestone Oct 20, 2021

findepi mentioned this pull request Nov 23, 2021

Parameterize TestParquetDecimalScaling on writer version #10001

Merged

yingsu00 mentioned this pull request Aug 24, 2023

ParquetWriter always write header version 2 even with version PARQUET_1_0 prestodb/presto#17240

Open

wanglinsong mentioned this pull request Aug 25, 2023

Add support of parquet_writer_version, 1.0 and 2.0 prestodb/presto#20209

Closed

yingsu00 mentioned this pull request Sep 21, 2023

Allow Parquet writer to write both V1 and V2 files prestodb/presto#20926

Closed

nmahadevuni mentioned this pull request Sep 25, 2023

Support Parquet writer versions V1 and V2 prestodb/presto#20957

Merged

nmahadevuni added a commit to nmahadevuni/presto that referenced this pull request Oct 3, 2023

Support Parquet writer versions V1 and V2

7cbbc0e

Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>

yingsu00 pushed a commit to prestodb/presto that referenced this pull request Oct 5, 2023

Support Parquet writer versions V1 and V2

5434a36

Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Mar 14, 2024

Support Parquet writer versions V1 and V2

5eaaa27

Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change native parquet writer to write v1 parquet files #9611

Change native parquet writer to write v1 parquet files #9611

joshthoward commented Oct 12, 2021 •

edited by findepi

Loading

findepi commented Oct 13, 2021

alexjo2144 commented Oct 13, 2021

joshthoward commented Oct 13, 2021

alexjo2144 commented Oct 13, 2021

alexjo2144 commented Oct 13, 2021

losipiuk left a comment

findepi left a comment

findepi commented Oct 19, 2021

joshthoward commented Oct 20, 2021

findepi commented Oct 20, 2021

Change native parquet writer to write v1 parquet files #9611

Change native parquet writer to write v1 parquet files #9611

Conversation

joshthoward commented Oct 12, 2021 • edited by findepi Loading

findepi commented Oct 13, 2021

alexjo2144 commented Oct 13, 2021

joshthoward commented Oct 13, 2021

alexjo2144 commented Oct 13, 2021

alexjo2144 commented Oct 13, 2021

losipiuk left a comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

findepi commented Oct 19, 2021

joshthoward commented Oct 20, 2021

findepi commented Oct 20, 2021

joshthoward commented Oct 12, 2021 •

edited by findepi

Loading