-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649
Conversation
Test build #135469 has started for PR 31649 at commit |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135512 has finished for PR 31649 at commit
|
Is the UT failure meaningful? Or, do you want to retry simply, @wangyum ? For the file size mismatch failure, we can update it if that is correct new size. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #135528 has finished for PR 31649 at commit
|
Note: I ran TPCDS queries (sf=20) based on this PR and there was no valid performance regression. |
Thank you, @maropu !
|
Hi, @wangyum . Could you update this PR with Apache Parquet 1.12.0 RC3? |
Done |
Thank you so much! |
Test build #135985 has finished for PR 31649 at commit
|
Retest this please |
retest this please. |
Test build #136017 has finished for PR 31649 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136286 has finished for PR 31649 at commit
|
BTW, after SBT dependency overriding, we have only one failure at
Although it happens in both CIs, I hope it's irrelevant. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136292 has finished for PR 31649 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @wangyum .
Parquet 1.12.0 is finally released. Could you use the official one by removing the pom file repository change?
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136539 has finished for PR 31649 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @wangyum , @maropu , @srowen .
- @maropu checked that there is no performance regression in TPCDS cases ([SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 #31649 (comment)).
- I hope we can test this in
master
branch actively for Apache Spark 3.2.0 timeframe (July) - We can revert this easily if there is an issue. Also, we can give feedbacks to
Apache Parquet
community to make a maintenance release, 1.12.1.
cc @dbtsai , @holdenk , @viirya , @sunchao , @ggershinsky, @attilapiros , @gatorsmile , @cloud-fan
Merged to master for Apache Spark 3.2.0 on July 2021. |
Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md - Bloom filters to improve filter performance - ZSTD enhancement No. Existing unit test. Closes apache#31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896) * [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) * [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 ### What changes were proposed in this pull request? Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md ### Why are the changes needed? - Bloom filters to improve filter performance - ZSTD enhancement ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit cbffc12) Co-authored-by: Chao Sun <[email protected]> * [HADP-44647] Parquet file based kms client for encryption keys (#897) * [HADP-44647] Parquet file based kms client for encryption keys (#82) create/write parquet encryption table. ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; create table parquet_encryption(a int, b int, c int) using parquet options ( 'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c', 'parquet.encryption.footer.key' 'footerKey'); ``` read parquet encryption table; ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; select ... from parquet_encryption ... ``` Will raise another pr for default footerKey. * [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84) * Fix Co-authored-by: fwang12 <[email protected]> Co-authored-by: Chao Sun <[email protected]> Co-authored-by: fwang12 <[email protected]>
What changes were proposed in this pull request?
Parquet 1.12.0 New Feature
Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md
Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit test.