SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. #897

rxin · 2014-05-28T00:14:15Z

I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).

… of HyperLogLog.

AmplabJenkins · 2014-05-28T00:17:58Z

Merged build triggered.

AmplabJenkins · 2014-05-28T00:18:03Z

Merged build started.

AmplabJenkins · 2014-05-28T00:55:20Z

Merged build finished.

AmplabJenkins · 2014-05-28T00:55:20Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15240/

ash211 · 2014-05-28T04:03:03Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

-    val createHLL = (v: V) => new SerializableHyperLogLog(new HyperLogLog(relativeSD)).add(v)
-    val mergeValueHLL = (hll: SerializableHyperLogLog, v: V) => hll.add(v)
-    val mergeHLL = (h1: SerializableHyperLogLog, h2: SerializableHyperLogLog) => h1.merge(h2)
+    val precision = (math.log((1.106 / relativeSD) * (1.106 / relativeSD)) / math.log(2)).toInt


Where does this magic value of 1.106 come from?

I'm not even sure if the math is correct yet. Will update the PR once I confirm it.

AmplabJenkins · 2014-05-31T07:22:58Z

Merged build triggered.

AmplabJenkins · 2014-05-31T07:23:06Z

Merged build started.

rxin · 2014-05-31T07:24:30Z

Ok I pushed a new version. This is no longer work in progress.

AmplabJenkins · 2014-05-31T08:03:26Z

Merged build finished.

AmplabJenkins · 2014-05-31T08:03:27Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15315/

rxin · 2014-05-31T08:12:01Z

@pwendell Jenkins failed due to binary compatibility for SerializableHyperLogLog, which is no longer needed ...

rxin · 2014-05-31T20:08:39Z

core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala

-  def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): JavaRDD[(K, Long)] = {
-    rdd.countApproxDistinctByKey(relativeSD, numPartitions)
+  @Deprecated
+  def countApproxDistinctByKey(relativeSD: Double): JavaPairRDD[K, Long] = {


Note that I changed the return type from JavaRDD[(K, Long)] to JavaPairRDD[K, Long], because that is what it should've been.

However, in order to maintain complete API stability, I can change it back and just deprecated the old methods. The new methods certainly should return JavaPairRDD.

…API compatibility.

AmplabJenkins · 2014-06-02T00:32:58Z

Merged build triggered.

AmplabJenkins · 2014-06-02T00:33:05Z

Merged build started.

AmplabJenkins · 2014-06-02T00:34:32Z

Merged build finished.

AmplabJenkins · 2014-06-02T00:34:32Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15338/

AmplabJenkins · 2014-06-02T00:47:58Z

Merged build triggered.

AmplabJenkins · 2014-06-02T00:48:05Z

Merged build started.

AmplabJenkins · 2014-06-02T01:28:35Z

Merged build finished.

AmplabJenkins · 2014-06-02T01:28:35Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15339/

rxin · 2014-06-02T05:11:23Z

Ok I think the latest push should resolve the Mima problems ...

AmplabJenkins · 2014-06-03T23:43:09Z

Merged build started.

mengxr · 2014-06-03T23:44:47Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

+   * and increase accuracy when the cardinality is small.
+   *
+   *@param p The precision value for the normal set.
+   *          `p` must be a value between 4 and `sp` (32 max).


add "if sp is not zero"

AmplabJenkins · 2014-06-04T00:23:04Z

Merged build triggered.

AmplabJenkins · 2014-06-04T00:23:12Z

Merged build started.

AmplabJenkins · 2014-06-04T00:23:15Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-04T00:23:16Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15419/

AmplabJenkins · 2014-06-04T00:42:59Z

Merged build triggered.

AmplabJenkins · 2014-06-04T00:43:08Z

Merged build started.

AmplabJenkins · 2014-06-04T01:02:58Z

Merged build triggered.

AmplabJenkins · 2014-06-04T01:03:05Z

Merged build started.

AmplabJenkins · 2014-06-04T01:03:45Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-04T01:03:46Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15420/

mengxr · 2014-06-04T01:36:47Z

LGTM. Merged.

AmplabJenkins · 2014-06-04T01:48:00Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-04T01:48:00Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-04T01:48:00Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15424/

AmplabJenkins · 2014-06-04T01:48:00Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15422/

… of HyperLogLog. I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results). Author: Reynold Xin <[email protected]> Closes apache#897 from rxin/hll and squashes the following commits: 4d83f41 [Reynold Xin] New error bound and non-randomness. f154ea0 [Reynold Xin] Added a comment on the value bound for testing. e367527 [Reynold Xin] One more round of code review. 41e649a [Reynold Xin] Update final mima list. 9e320c8 [Reynold Xin] Incorporate code review feedback. e110d70 [Reynold Xin] Merge branch 'master' into hll 354deb8 [Reynold Xin] Added comment on the Mima exclude rules. acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes. 6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes. 1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check. 9221b27 [Reynold Xin] Merge branch 'master' into hll 88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility. 1294be6 [Reynold Xin] Updated HLL+. e7786cb [Reynold Xin] Merge branch 'master' into hll c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.

* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896) * [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) * [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 ### What changes were proposed in this pull request? Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md ### Why are the changes needed? - Bloom filters to improve filter performance - ZSTD enhancement ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit cbffc12) Co-authored-by: Chao Sun <[email protected]> * [HADP-44647] Parquet file based kms client for encryption keys (#897) * [HADP-44647] Parquet file based kms client for encryption keys (#82) create/write parquet encryption table. ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; create table parquet_encryption(a int, b int, c int) using parquet options ( 'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c', 'parquet.encryption.footer.key' 'footerKey'); ``` read parquet encryption table; ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; select ... from parquet_encryption ... ``` Will raise another pr for default footerKey. * [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84) * Fix Co-authored-by: fwang12 <[email protected]> Co-authored-by: Chao Sun <[email protected]> Co-authored-by: fwang12 <[email protected]>

SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead…

c0ef0c2

… of HyperLogLog.

ash211 reviewed May 28, 2014
View reviewed changes

rxin added 2 commits May 30, 2014 22:17

Merge branch 'master' into hll

e7786cb

Updated HLL+.

1294be6

rxin changed the title ~~[WIP] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.~~ SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. May 31, 2014

rxin reviewed May 31, 2014
View reviewed changes

rxin added 3 commits June 1, 2014 17:29

Updated documentation and restored the old incorrect API to maintain …

88cfe77

…API compatibility.

Merge branch 'master' into hll

9221b27

Excluded util.SerializableHyperLogLog from MIMA check.

1db1522

Added a default method and re-arranged MimaExcludes.

6555bfe

Added the right exclude rules in MimaExcludes.

acaa524

mengxr reviewed Jun 3, 2014
View reviewed changes

One more round of code review.

e367527

Added a comment on the value bound for testing.

f154ea0

New error bound and non-randomness.

4d83f41

asfgit closed this in 1faef14 Jun 4, 2014

rxin deleted the hll branch June 4, 2014 01:54

SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. #897

SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. #897

Conversation

rxin commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

ash211 May 28, 2014

Choose a reason for hiding this comment

rxin May 28, 2014

Choose a reason for hiding this comment

AmplabJenkins commented May 31, 2014

AmplabJenkins commented May 31, 2014

rxin commented May 31, 2014

AmplabJenkins commented May 31, 2014

AmplabJenkins commented May 31, 2014

rxin commented May 31, 2014

rxin May 31, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014

AmplabJenkins commented Jun 3, 2014

mengxr Jun 3, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

mengxr commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014

AmplabJenkins commented Jun 4, 2014