Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. #897

Closed
wants to merge 15 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented May 28, 2014

I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15240/

val createHLL = (v: V) => new SerializableHyperLogLog(new HyperLogLog(relativeSD)).add(v)
val mergeValueHLL = (hll: SerializableHyperLogLog, v: V) => hll.add(v)
val mergeHLL = (h1: SerializableHyperLogLog, h2: SerializableHyperLogLog) => h1.merge(h2)
val precision = (math.log((1.106 / relativeSD) * (1.106 / relativeSD)) / math.log(2)).toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this magic value of 1.106 come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure if the math is correct yet. Will update the PR once I confirm it.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@rxin rxin changed the title [WIP] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. May 31, 2014
@rxin
Copy link
Contributor Author

rxin commented May 31, 2014

Ok I pushed a new version. This is no longer work in progress.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15315/

@rxin
Copy link
Contributor Author

rxin commented May 31, 2014

@pwendell Jenkins failed due to binary compatibility for SerializableHyperLogLog, which is no longer needed ...

def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): JavaRDD[(K, Long)] = {
rdd.countApproxDistinctByKey(relativeSD, numPartitions)
@Deprecated
def countApproxDistinctByKey(relativeSD: Double): JavaPairRDD[K, Long] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I changed the return type from JavaRDD[(K, Long)] to JavaPairRDD[K, Long], because that is what it should've been.

However, in order to maintain complete API stability, I can change it back and just deprecated the old methods. The new methods certainly should return JavaPairRDD.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15338/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15339/

@rxin
Copy link
Contributor Author

rxin commented Jun 2, 2014

Ok I think the latest push should resolve the Mima problems ...

@AmplabJenkins
Copy link

Merged build started.

* and increase accuracy when the cardinality is small.
*
*@param p The precision value for the normal set.
* `p` must be a value between 4 and `sp` (32 max).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "if sp is not zero"

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15419/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15420/

@mengxr
Copy link
Contributor

mengxr commented Jun 4, 2014

LGTM. Merged.

@asfgit asfgit closed this in 1faef14 Jun 4, 2014
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

1 similar comment
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15424/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15422/

@rxin rxin deleted the hll branch June 4, 2014 01:54
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
… of HyperLogLog.

I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).

Author: Reynold Xin <[email protected]>

Closes apache#897 from rxin/hll and squashes the following commits:

4d83f41 [Reynold Xin] New error bound and non-randomness.
f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
e367527 [Reynold Xin] One more round of code review.
41e649a [Reynold Xin] Update final mima list.
9e320c8 [Reynold Xin] Incorporate code review feedback.
e110d70 [Reynold Xin] Merge branch 'master' into hll
354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
9221b27 [Reynold Xin] Merge branch 'master' into hll
88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
1294be6 [Reynold Xin] Updated HLL+.
e7786cb [Reynold Xin] Merge branch 'master' into hll
c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
… of HyperLogLog.

I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).

Author: Reynold Xin <[email protected]>

Closes apache#897 from rxin/hll and squashes the following commits:

4d83f41 [Reynold Xin] New error bound and non-randomness.
f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
e367527 [Reynold Xin] One more round of code review.
41e649a [Reynold Xin] Update final mima list.
9e320c8 [Reynold Xin] Incorporate code review feedback.
e110d70 [Reynold Xin] Merge branch 'master' into hll
354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
9221b27 [Reynold Xin] Merge branch 'master' into hll
88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
1294be6 [Reynold Xin] Updated HLL+.
e7786cb [Reynold Xin] Merge branch 'master' into hll
c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
wangyum added a commit that referenced this pull request May 26, 2023
* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896)

* [SPARK-36726] Upgrade Parquet to 1.12.1

### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>

(cherry picked from commit a927b08)

* [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

### What changes were proposed in this pull request?

Parquet 1.12.0 New Feature
- PARQUET-41 - Add bloom filters to parquet statistics
- PARQUET-1373 - Encryption key management tools
- PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
- PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
- PARQUET-1784 - Column-wise configuration
- PARQUET-1817 - Crypto Properties Factory
- PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

### Why are the changes needed?

- Bloom filters to improve filter performance
- ZSTD enhancement

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.

Closes #31649 from wangyum/SPARK-34542.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit cbffc12)

Co-authored-by: Chao Sun <[email protected]>

* [HADP-44647] Parquet file based kms client for encryption keys (#897)

* [HADP-44647] Parquet file based kms client for encryption keys (#82)

create/write parquet encryption table.
```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

create table parquet_encryption(a int, b int, c int)
using parquet
options (
'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c',
'parquet.encryption.footer.key' 'footerKey');
```

read parquet encryption table;

```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

select ... from parquet_encryption ...
```

Will raise another pr for default footerKey.

* [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84)

* Fix

Co-authored-by: fwang12 <[email protected]>

Co-authored-by: Chao Sun <[email protected]>
Co-authored-by: fwang12 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants