Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

Closed
wants to merge 12 commits into from
Closed

[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

wants to merge 12 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Feb 25, 2021

What changes were proposed in this pull request?

Parquet 1.12.0 New Feature

  • PARQUET-41 - Add bloom filters to parquet statistics
  • PARQUET-1373 - Encryption key management tools
  • PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
  • PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
  • PARQUET-1784 - Column-wise configuration
  • PARQUET-1817 - Crypto Properties Factory
  • PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

Why are the changes needed?

  • Bloom filters to improve filter performance
  • ZSTD enhancement

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test.

@wangyum wangyum marked this pull request as draft February 25, 2021 15:05
@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Test build #135469 has started for PR 31649 at commit 799364e.

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40049/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40093/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40093/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Test build #135512 has finished for PR 31649 at commit 741eb21.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2021

Is the UT failure meaningful? Or, do you want to retry simply, @wangyum ?

For the file size mismatch failure, we can update it if that is correct new size.

@github-actions github-actions bot added the SQL label Feb 27, 2021
@SparkQA
Copy link

SparkQA commented Feb 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40109/

@SparkQA
Copy link

SparkQA commented Feb 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40109/

@SparkQA
Copy link

SparkQA commented Feb 27, 2021

Test build #135528 has finished for PR 31649 at commit e7b14c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 1, 2021

Note: I ran TPCDS queries (sf=20) based on this PR and there was no valid performance regression.

@dongjoon-hyun
Copy link
Member

Thank you, @maropu !

Note: I ran TPCDS queries (sf=20) based on this PR and there was no valid performance regression.

@dongjoon-hyun
Copy link
Member

Hi, @wangyum . Could you update this PR with Apache Parquet 1.12.0 RC3?

@wangyum
Copy link
Member Author

wangyum commented Mar 12, 2021

Could you update this PR with Apache Parquet 1.12.0 RC3?

Done

@dongjoon-hyun
Copy link
Member

Thank you so much!

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Test build #135985 has finished for PR 31649 at commit 4b4bd97.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • * inner class (See SPARK-34607 for details). This issue has already been fixed in jdk9+, so
  • //getSimpleBinaryName() returns null if a given class is a top-level class,
  • public class JavaModelSelectionViaRandomHyperparametersExample
  • class GangliaSink(
  • case class Limits[T: Numeric](x: T, y: T)
  • abstract class Generator[T: Numeric]
  • class ParamRandomBuilder extends ParamGridBuilder
  • class ParamRandomBuilder(ParamGridBuilder):
  • sealed trait PartitionSpec extends LeafExpression with Unevaluable
  • case class Product(child: Expression)
  • trait ExtractValue extends Expression
  • trait V2PartitionCommand extends Command
  • case class AnalyzeTables(
  • case class ShowCreateTable(
  • case class TruncateTable(table: LogicalPlan) extends Command
  • case class TruncatePartition(
  • public class ParquetFooterReader
  • case class AnalyzeTablesCommand(
  • case class AddArchiveCommand(path: String) extends RunnableCommand
  • case class ListArchivesCommand(archives: Seq[String] = Seq.empty[String]) extends RunnableCommand
  • case class ShowCreateTableCommand(
  • case class ShowCreateTableAsSerdeCommand(
  • case class TruncatePartitionExec(
  • trait HashJoin extends JoinCodegenSupport
  • trait JoinCodegenSupport extends CodegenSupport with BaseJoinExec

@dongjoon-hyun
Copy link
Member

Retest this please

@wangyum
Copy link
Member Author

wangyum commented Mar 12, 2021

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Test build #136017 has finished for PR 31649 at commit 4b4bd97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • * inner class (See SPARK-34607 for details). This issue has already been fixed in jdk9+, so
  • //getSimpleBinaryName() returns null if a given class is a top-level class,
  • public class JavaModelSelectionViaRandomHyperparametersExample
  • class GangliaSink(
  • case class Limits[T: Numeric](x: T, y: T)
  • abstract class Generator[T: Numeric]
  • class ParamRandomBuilder extends ParamGridBuilder
  • class ParamRandomBuilder(ParamGridBuilder):
  • sealed trait PartitionSpec extends LeafExpression with Unevaluable
  • case class Product(child: Expression)
  • trait ExtractValue extends Expression
  • trait V2PartitionCommand extends Command
  • case class AnalyzeTables(
  • case class ShowCreateTable(
  • case class TruncateTable(table: LogicalPlan) extends Command
  • case class TruncatePartition(
  • public class ParquetFooterReader
  • case class AnalyzeTablesCommand(
  • case class AddArchiveCommand(path: String) extends RunnableCommand
  • case class ListArchivesCommand(archives: Seq[String] = Seq.empty[String]) extends RunnableCommand
  • case class ShowCreateTableCommand(
  • case class ShowCreateTableAsSerdeCommand(
  • case class TruncatePartitionExec(
  • trait HashJoin extends JoinCodegenSupport
  • trait JoinCodegenSupport extends CodegenSupport with BaseJoinExec

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40829/

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40829/

@SparkQA
Copy link

SparkQA commented Mar 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40868/

@SparkQA
Copy link

SparkQA commented Mar 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40868/

@SparkQA
Copy link

SparkQA commented Mar 20, 2021

Test build #136286 has finished for PR 31649 at commit 0ff5114.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 20, 2021

BTW, after SBT dependency overriding, we have only one failure at SparkSubmitUtilsSuite . Thanks, @wangyum .
cc @iemejia

Although it happens in both CIs, I hope it's irrelevant.

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40874/

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40874/

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Test build #136292 has finished for PR 31649 at commit 145dba0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wangyum .
Parquet 1.12.0 is finally released. Could you use the official one by removing the pom file repository change?

@wangyum wangyum marked this pull request as ready for review March 26, 2021 00:52
@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41123/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41123/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136539 has finished for PR 31649 at commit 8b58e29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @wangyum , @maropu , @srowen .

  • @maropu checked that there is no performance regression in TPCDS cases ([SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649 (comment)).
  • I hope we can test this in master branch actively for Apache Spark 3.2.0 timeframe (July)
  • We can revert this easily if there is an issue. Also, we can give feedbacks to Apache Parquet community to make a maintenance release, 1.12.1.

cc @dbtsai , @holdenk , @viirya , @sunchao , @ggershinsky, @attilapiros , @gatorsmile , @cloud-fan

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.2.0 on July 2021.

@wangyum wangyum deleted the SPARK-34542 branch March 27, 2021 15:00
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
Parquet 1.12.0 New Feature
- PARQUET-41 - Add bloom filters to parquet statistics
- PARQUET-1373 - Encryption key management tools
- PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
- PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
- PARQUET-1784 - Column-wise configuration
- PARQUET-1817 - Crypto Properties Factory
- PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

- Bloom filters to improve filter performance
- ZSTD enhancement

No.

Existing unit test.

Closes apache#31649 from wangyum/SPARK-34542.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
wangyum added a commit that referenced this pull request May 26, 2023
* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896)

* [SPARK-36726] Upgrade Parquet to 1.12.1

### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>

(cherry picked from commit a927b08)

* [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

### What changes were proposed in this pull request?

Parquet 1.12.0 New Feature
- PARQUET-41 - Add bloom filters to parquet statistics
- PARQUET-1373 - Encryption key management tools
- PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
- PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
- PARQUET-1784 - Column-wise configuration
- PARQUET-1817 - Crypto Properties Factory
- PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

### Why are the changes needed?

- Bloom filters to improve filter performance
- ZSTD enhancement

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.

Closes #31649 from wangyum/SPARK-34542.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit cbffc12)

Co-authored-by: Chao Sun <[email protected]>

* [HADP-44647] Parquet file based kms client for encryption keys (#897)

* [HADP-44647] Parquet file based kms client for encryption keys (#82)

create/write parquet encryption table.
```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

create table parquet_encryption(a int, b int, c int)
using parquet
options (
'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c',
'parquet.encryption.footer.key' 'footerKey');
```

read parquet encryption table;

```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

select ... from parquet_encryption ...
```

Will raise another pr for default footerKey.

* [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84)

* Fix

Co-authored-by: fwang12 <[email protected]>

Co-authored-by: Chao Sun <[email protected]>
Co-authored-by: fwang12 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants