Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet writer versions V1 and V2 #20957

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

nmahadevuni
Copy link
Member

@nmahadevuni nmahadevuni commented Sep 25, 2023

Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611

Description

Adds support for writing Parquet data versions V1 and V2 using a config and session property.

Motivation and Context

Fixes #20926

Impact

No external changes. Default Parquet version is set to V2 which is currently used in Presto.

Test Plan

Amended the ParquetTester.java to run round trip tests for both writer versions.

== RELEASE NOTES ==

Hive Changes
* Support for Parquet writer versions V1 and V2 is now added for Hive and Iceberg catalogs. It can be toggled using session property `parquet_writer_version` and config property `hive.parquet.writer.version`. Valid values for these properties are PARQUET_1_0 and PARQUET_2_0. Default is PARQUET_2_0. E.g., set session parquet_writer_version=PARQUET_1_0 / hive.parquet.writer.version=PARQUET_1_0

@nmahadevuni nmahadevuni requested review from shangxinli and a team as code owners September 25, 2023 09:31
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 25, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: nmahadevuni / name: Naveen Mahadevuni (7cbbc0e)

@nmahadevuni nmahadevuni force-pushed the parquet_writer_support_V1_V2 branch 2 times, most recently from 9d2fc6d to 43348ee Compare September 25, 2023 17:04
@nmahadevuni
Copy link
Member Author

@yingsu00 Can you please take a look?

Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni Thanks for the nice work. It'll be best if the release note can include an example so the users know exactly how the session property should be configured. Other than that it's just a couple of nits.

implements ColumnWriter
{
private final Type type;
private final ColumnDescriptor columnDescriptor;
protected final ColumnDescriptor columnDescriptor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the private ones to be after the protected ones


if (getBufferedBytes() >= pageSizeThreshold) {
flushCurrentPageToBuffer();
}
}

protected abstract void writeDefinitionAndRepetitionLevels(ColumnChunk current)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this method to be after all public ones

@yingsu00 yingsu00 requested a review from zhenxiao September 27, 2023 05:05
@yingsu00
Copy link
Contributor

@nmahadevuni there is an easyCLA issue.

@nmahadevuni
Copy link
Member Author

@nmahadevuni there is an easyCLA issue.

That's for the co-author who is not registered in LF easyCLA. There is no way to fix that as my previous issue created with Linux foundation about it was closed saying that its expected and there is no way to remove that.

@nmahadevuni nmahadevuni force-pushed the parquet_writer_support_V1_V2 branch from 43348ee to 86db17d Compare September 27, 2023 10:26
@yingsu00
Copy link
Contributor

@nmahadevuni Make yourself as the main author. If that doesn't work, remove the co-author information and add a line in the commit message "co-authored by ..."

@nmahadevuni nmahadevuni force-pushed the parquet_writer_support_V1_V2 branch from 86db17d to f037a07 Compare September 28, 2023 04:58
@nmahadevuni nmahadevuni force-pushed the parquet_writer_support_V1_V2 branch from f037a07 to 7cbbc0e Compare October 3, 2023 05:55
@nmahadevuni
Copy link
Member Author

@zhenxiao @shangxinli Can you please review this PR?

@yingsu00 yingsu00 merged commit 5434a36 into prestodb:master Oct 5, 2023
@mbasmanova
Copy link
Contributor

This change seems to have collided with another change and together they are causing compilation failures.

See #19983 (comment)

tdcmeehan added a commit to tdcmeehan/presto that referenced this pull request Oct 5, 2023
prestodb#20957 and prestodb#19983 were merged at similar times, and when taken
together broke the build (as changes required by prestodb#20957 aren't
present in prestodb#19983).
@tdcmeehan
Copy link
Contributor

I raised #21043 to fix the build

tdcmeehan added a commit that referenced this pull request Oct 5, 2023
#20957 and #19983 were merged at similar times, and when taken
together broke the build (as changes required by #20957 aren't
present in #19983).
@wanglinsong wanglinsong mentioned this pull request Dec 8, 2023
26 tasks
kaikalur pushed a commit to kaikalur/presto that referenced this pull request Mar 14, 2024
prestodb#20957 and prestodb#19983 were merged at similar times, and when taken
together broke the build (as changes required by prestodb#20957 aren't
present in prestodb#19983).
@majetideepak
Copy link
Collaborator

@nmahadevuni, @yingsu00 I see from the Trino commit in the trinodb/trino#9611 description that Trino switched the default to PARQUET_1_0. Is there a reason why Presto chose PARQUET_2_0?
The reason being Parquet V2 spec has not been finalized.
trinodb/trino#7953 (comment)

@yingsu00
Copy link
Contributor

yingsu00 commented Aug 2, 2024

This PR merely added a config property that allows the user to choose v1 or v2. Before this change, Presto always write V2 and user cannot change it. To avoid disruption to existing users we didn't change the default version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Parquet writer to write both V1 and V2 files
5 participants