[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

wangyum · 2021-02-25T15:05:50Z

What changes were proposed in this pull request?

Parquet 1.12.0 New Feature

PARQUET-41 - Add bloom filters to parquet statistics
PARQUET-1373 - Encryption key management tools
PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
PARQUET-1784 - Column-wise configuration
PARQUET-1817 - Crypto Properties Factory
PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

Why are the changes needed?

Bloom filters to improve filter performance
ZSTD enhancement

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test.

SparkQA · 2021-02-25T15:52:58Z

Test build #135469 has started for PR 31649 at commit 799364e.

SparkQA · 2021-02-25T16:06:15Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40049/

SparkQA · 2021-02-26T15:31:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40093/

SparkQA · 2021-02-26T16:01:23Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40093/

SparkQA · 2021-02-26T16:46:28Z

Test build #135512 has finished for PR 31649 at commit 741eb21.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-02-27T00:02:59Z

Is the UT failure meaningful? Or, do you want to retry simply, @wangyum ?

For the file size mismatch failure, we can update it if that is correct new size.

SparkQA · 2021-02-27T02:15:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40109/

SparkQA · 2021-02-27T02:24:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40109/

SparkQA · 2021-02-27T04:30:53Z

Test build #135528 has finished for PR 31649 at commit e7b14c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pom.xml

maropu · 2021-03-01T12:33:09Z

Note: I ran TPCDS queries (sf=20) based on this PR and there was no valid performance regression.

dongjoon-hyun · 2021-03-02T11:05:44Z

Thank you, @maropu !

Note: I ran TPCDS queries (sf=20) based on this PR and there was no valid performance regression.

dongjoon-hyun · 2021-03-11T18:23:24Z

Hi, @wangyum . Could you update this PR with Apache Parquet 1.12.0 RC3?

wangyum · 2021-03-12T00:55:53Z

Could you update this PR with Apache Parquet 1.12.0 RC3?

Done

dongjoon-hyun · 2021-03-12T00:58:17Z

Thank you so much!

SparkQA · 2021-03-12T03:45:47Z

Test build #135985 has finished for PR 31649 at commit 4b4bd97.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* inner class (See SPARK-34607 for details). This issue has already been fixed in jdk9+, so
//getSimpleBinaryName() returns null if a given class is a top-level class,
public class JavaModelSelectionViaRandomHyperparametersExample
class GangliaSink(
case class Limits[T: Numeric](x: T, y: T)
abstract class Generator[T: Numeric]
class ParamRandomBuilder extends ParamGridBuilder
class ParamRandomBuilder(ParamGridBuilder):
sealed trait PartitionSpec extends LeafExpression with Unevaluable
case class Product(child: Expression)
trait ExtractValue extends Expression
trait V2PartitionCommand extends Command
case class AnalyzeTables(
case class ShowCreateTable(
case class TruncateTable(table: LogicalPlan) extends Command
case class TruncatePartition(
public class ParquetFooterReader
case class AnalyzeTablesCommand(
case class AddArchiveCommand(path: String) extends RunnableCommand
case class ListArchivesCommand(archives: Seq[String] = Seq.empty[String]) extends RunnableCommand
case class ShowCreateTableCommand(
case class ShowCreateTableAsSerdeCommand(
case class TruncatePartitionExec(
trait HashJoin extends JoinCodegenSupport
trait JoinCodegenSupport extends CodegenSupport with BaseJoinExec

dongjoon-hyun · 2021-03-12T05:36:55Z

Retest this please

wangyum · 2021-03-12T18:35:01Z

retest this please.

SparkQA · 2021-03-12T21:30:38Z

Test build #136017 has finished for PR 31649 at commit 4b4bd97.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* inner class (See SPARK-34607 for details). This issue has already been fixed in jdk9+, so
//getSimpleBinaryName() returns null if a given class is a top-level class,
public class JavaModelSelectionViaRandomHyperparametersExample
class GangliaSink(
case class Limits[T: Numeric](x: T, y: T)
abstract class Generator[T: Numeric]
class ParamRandomBuilder extends ParamGridBuilder
class ParamRandomBuilder(ParamGridBuilder):
sealed trait PartitionSpec extends LeafExpression with Unevaluable
case class Product(child: Expression)
trait ExtractValue extends Expression
trait V2PartitionCommand extends Command
case class AnalyzeTables(
case class ShowCreateTable(
case class TruncateTable(table: LogicalPlan) extends Command
case class TruncatePartition(
public class ParquetFooterReader
case class AnalyzeTablesCommand(
case class AddArchiveCommand(path: String) extends RunnableCommand
case class ListArchivesCommand(archives: Seq[String] = Seq.empty[String]) extends RunnableCommand
case class ShowCreateTableCommand(
case class ShowCreateTableAsSerdeCommand(
case class TruncatePartitionExec(
trait HashJoin extends JoinCodegenSupport
trait JoinCodegenSupport extends CodegenSupport with BaseJoinExec

SparkQA · 2021-03-19T07:36:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40829/

SparkQA · 2021-03-19T07:44:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40829/

SparkQA · 2021-03-20T15:55:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40868/

SparkQA · 2021-03-20T16:00:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40868/

SparkQA · 2021-03-20T17:31:07Z

Test build #136286 has finished for PR 31649 at commit 0ff5114.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-03-20T20:38:01Z

BTW, after SBT dependency overriding, we have only one failure at SparkSubmitUtilsSuite . Thanks, @wangyum .
cc @iemejia

Although it happens in both CIs, I hope it's irrelevant.

SparkQA · 2021-03-21T02:42:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40874/

SparkQA · 2021-03-21T02:51:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40874/

SparkQA · 2021-03-21T04:00:13Z

Test build #136292 has finished for PR 31649 at commit 145dba0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Hi, @wangyum .
Parquet 1.12.0 is finally released. Could you use the official one by removing the pom file repository change?

SparkQA · 2021-03-26T02:15:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41123/

SparkQA · 2021-03-26T03:08:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41123/

SparkQA · 2021-03-26T04:48:05Z

Test build #136539 has finished for PR 31649 at commit 8b58e29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @wangyum , @maropu , @srowen .

@maropu checked that there is no performance regression in TPCDS cases ([SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649 (comment)).
I hope we can test this in master branch actively for Apache Spark 3.2.0 timeframe (July)
We can revert this easily if there is an issue. Also, we can give feedbacks to Apache Parquet community to make a maintenance release, 1.12.1.

cc @dbtsai , @holdenk , @viirya , @sunchao , @ggershinsky, @attilapiros , @gatorsmile , @cloud-fan

dongjoon-hyun · 2021-03-27T14:56:15Z

Merged to master for Apache Spark 3.2.0 on July 2021.

Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md - Bloom filters to improve filter performance - ZSTD enhancement No. Existing unit test. Closes apache#31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896) * [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) * [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 ### What changes were proposed in this pull request? Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md ### Why are the changes needed? - Bloom filters to improve filter performance - ZSTD enhancement ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit cbffc12) Co-authored-by: Chao Sun <[email protected]> * [HADP-44647] Parquet file based kms client for encryption keys (#897) * [HADP-44647] Parquet file based kms client for encryption keys (#82) create/write parquet encryption table. ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; create table parquet_encryption(a int, b int, c int) using parquet options ( 'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c', 'parquet.encryption.footer.key' 'footerKey'); ``` read parquet encryption table; ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; select ... from parquet_encryption ... ``` Will raise another pr for default footerKey. * [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84) * Fix Co-authored-by: fwang12 <[email protected]> Co-authored-by: Chao Sun <[email protected]> Co-authored-by: fwang12 <[email protected]>

Upgrade Parquet to 1.12.0

799364e

wangyum marked this pull request as draft February 25, 2021 15:05

github-actions bot added the BUILD label Feb 25, 2021

wangyum added the DEPLOY label Feb 26, 2021

git commit --allow-empty -m "Trigger GithubAction"

741eb21

Update StatisticsSuite.scala

e7b14c8

github-actions bot added the SQL label Feb 27, 2021

srowen reviewed Feb 27, 2021

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into SPARK-34542

4b4bd97

wangyum added 3 commits March 17, 2021 21:02

Merge remote-tracking branch 'upstream/master' into SPARK-34542

f98306b

Merge remote-tracking branch 'upstream/master' into SPARK-34542

ce0f201

Avro to 1.10.2 and jackson to 2.12.2

cab8d7e

Merge remote-tracking branch 'upstream/master' into SPARK-34542

948d64e

fix

0ff5114

Merge remote-tracking branch 'upstream/master' into SPARK-34542

145dba0

dongjoon-hyun reviewed Mar 25, 2021

View reviewed changes

wangyum added 2 commits March 26, 2021 08:43

Merge remote-tracking branch 'upstream/master' into SPARK-34542

d51eb7e

fix

8b58e29

wangyum marked this pull request as ready for review March 26, 2021 00:52

dongjoon-hyun approved these changes Mar 26, 2021

View reviewed changes

dongjoon-hyun closed this in cbffc12 Mar 27, 2021

wangyum deleted the SPARK-34542 branch March 27, 2021 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

wangyum commented Feb 25, 2021 •

edited

Loading

SparkQA commented Feb 25, 2021

SparkQA commented Feb 25, 2021

SparkQA commented Feb 26, 2021

SparkQA commented Feb 26, 2021

SparkQA commented Feb 26, 2021

dongjoon-hyun commented Feb 27, 2021 •

edited

Loading

SparkQA commented Feb 27, 2021

SparkQA commented Feb 27, 2021

SparkQA commented Feb 27, 2021

maropu commented Mar 1, 2021

dongjoon-hyun commented Mar 2, 2021

dongjoon-hyun commented Mar 11, 2021

wangyum commented Mar 12, 2021

dongjoon-hyun commented Mar 12, 2021

SparkQA commented Mar 12, 2021

dongjoon-hyun commented Mar 12, 2021

wangyum commented Mar 12, 2021

SparkQA commented Mar 12, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

dongjoon-hyun commented Mar 20, 2021 •

edited

Loading

SparkQA commented Mar 21, 2021

SparkQA commented Mar 21, 2021

SparkQA commented Mar 21, 2021

dongjoon-hyun left a comment

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Mar 27, 2021

[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649

Conversation

wangyum commented Feb 25, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 25, 2021

SparkQA commented Feb 25, 2021

SparkQA commented Feb 26, 2021

SparkQA commented Feb 26, 2021

SparkQA commented Feb 26, 2021

dongjoon-hyun commented Feb 27, 2021 • edited Loading

SparkQA commented Feb 27, 2021

SparkQA commented Feb 27, 2021

SparkQA commented Feb 27, 2021

maropu commented Mar 1, 2021

dongjoon-hyun commented Mar 2, 2021

dongjoon-hyun commented Mar 11, 2021

wangyum commented Mar 12, 2021

dongjoon-hyun commented Mar 12, 2021

SparkQA commented Mar 12, 2021

dongjoon-hyun commented Mar 12, 2021

wangyum commented Mar 12, 2021

SparkQA commented Mar 12, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

dongjoon-hyun commented Mar 20, 2021 • edited Loading

SparkQA commented Mar 21, 2021

SparkQA commented Mar 21, 2021

SparkQA commented Mar 21, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 27, 2021

wangyum commented Feb 25, 2021 •

edited

Loading

dongjoon-hyun commented Feb 27, 2021 •

edited

Loading

dongjoon-hyun commented Mar 20, 2021 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading