-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1240: handle the case of empty RDD when takeSample #135
Conversation
Merged build triggered. |
Merged build started. |
@@ -310,6 +310,9 @@ abstract class RDD[T: ClassTag]( | |||
* Return a sampled subset of this RDD. | |||
*/ | |||
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = { | |||
if (fraction < Double.MinValue || fraction > Double.MaxValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use require. i.e.
require(fraction > Double.MinValue && fraction < Double.MaxValue, "...")
Shouldn't you just check for fraction > 0 but < 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lower bound should be >= 0.0. Sample with replacement can have a faction greater than 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @rxin , I'm also a bit confused here, I think the name of the argument is a bit confusing
The above line contains a multiplier to ensure that the sampling can return enough sample points in most of cases..(I think so), so the fraction value can actually be larger than 1
also, this value actually determines the mean value of Poisson/Bernoulli distribution
@@ -310,6 +310,8 @@ abstract class RDD[T: ClassTag]( | |||
* Return a sampled subset of this RDD. | |||
*/ | |||
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = { | |||
require(fraction >= 0 && fraction <= Double.MaxValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require(fraction >= 0.0)
should be sufficient here.
Merged build finished. |
All automated tests passed. |
Merged build triggered. |
Merged build started. |
@@ -457,6 +457,10 @@ class RDDSuite extends FunSuite with SharedSparkContext { | |||
|
|||
test("takeSample") { | |||
val data = sc.parallelize(1 to 100, 2) | |||
val emptySet = data.mapPartitions { iter => Iterator.empty } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us create a separate test "takeSample from an empty rdd" and construct an empty rdd directly:
val emptyRdd = sc.parallelize(Seq.empty[Int], 2)
Merged build finished. |
All automated tests passed. |
Merged build triggered. |
Merged build started. |
LGTM. Waiting for Jenkins. |
Merged build finished. |
All automated tests passed. |
Can you check whether this is broken in Python too, and fix it there as well? |
sure, will do that this evening~ |
Merged build triggered. |
Merged build started. |
Merged build finished. |
All automated tests passed. |
@mateiz, done~ |
Actually sorry, one other thought -- instead of throwing an error when Anyway both this and the empty RDD case are good catches. |
Hi, @mateiz , I think the current implementation allows fraction == 0 case, or I misunderstood something? |
Ah, thanks, I missed that. Merged this in. |
https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <[email protected]> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <[email protected]> Closes apache#135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
Previous merge with upstream broke the build. Author: Herman van Hovell <[email protected]> Closes apache#135 from hvanhovell/fix-hook-calling-external-catalog.
[ESPARK-135] 解决当不需要合并的时候仍然更换临时目录,导致最终结束为空的问题 解决当不需要合并的时候仍然更换临时目录,导致最终结束为空的问题 . resolve apache#135 See merge request !82
(cherry picked from commit b27ea82)
* Support k8s E2E test against specified k8s version - Create install-k8s role - Support to specify version of k8s and etcd - Add new job cloud-provider-openstack-acceptance-test-e2e-conformance-latest-release - Skip to copy 0 size test_results.html - Add post.yaml in cloud-provider-openstack-acceptance-test-e2e-conformance as a placeholder to upload test result to testgrid, but upload_e2e.py can not be found, implement it in following patchset. Partial-Bug: kubernetes/cloud-provider-openstack#103
* apache#135 bump jackson version to 2.10.4 * apache#135 update spark version r41 Co-authored-by: Yu Gan <[email protected]>
Co-authored-by: Yu Gan <[email protected]>
* apache#135 bump jackson version to 2.10.4 * apache#135 update spark version r41 Co-authored-by: Yu Gan <[email protected]>
Co-authored-by: Yu Gan <[email protected]>
### What changes were proposed in this pull request? The pr aims to upgrade commons-codec from 1.15 to 1.16.0. ### Why are the changes needed? 1.The new version brings some bug fixed, eg: - Fix byte-skipping in Base16 decoding #135. Fixes CODEC-305. - BaseNCodecOutputStream.eof() should not throw IOException. - Add support for Blake3 family of hashes. Fixes [CODEC-296](https://issues.apache.org/jira/browse/CODEC-296). 2.The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41707 from panbingkun/SPARK-44151. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
https://spark-project.atlassian.net/browse/SPARK-1240
It seems that the current implementation does not handle the empty RDD case when run takeSample
In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
In the test case, I also add several lines for this case