Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14163][CORE] SumEvaluator and countApprox cannot reliably handle RDDs of size 1 #11982

Closed
wants to merge 2 commits into from

Conversation

mtustin-handy
Copy link
Contributor

What changes were proposed in this pull request?

This special cases 0 and 1 counts to avoid passing 0 degrees of freedom.

How was this patch tested?

Tests run successfully.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

val degreesOfFreedom = (counter.count - 1).toInt
new TDistribution(degreesOfFreedom).inverseCumulativeProbability(1 - (1 - confidence) / 2)
} else {
// this may not be statistically meaningful
confidence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't valid; confidence is really a probability and confFactor is a number of standard deviations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, Double.PositiveInfinity it is then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@rxin
Copy link
Contributor

rxin commented Mar 28, 2016

Do you guys use this feature? I'm actually thinking we should deprecate this feature over time.

@srowen
Copy link
Member

srowen commented Mar 28, 2016

@rxin these are used to implement sumApprox, meanApprox, etc on RDD. I don't have a view on whether those are used much. I figure we can go ahead and fix it for this case right now in any event.

@rxin
Copy link
Contributor

rxin commented Mar 28, 2016

Yea fixing them is fine. I'm asking because this is the 1st time I've seen people using this api in public and want to get some feedback here.

@mtustin-handy
Copy link
Contributor Author

I hit this with countApprox.

@mtustin-handy
Copy link
Contributor Author

@rxin @srowen Looks like github is not showing the correct diff any more. I'm going to close this, reopen and tag you.

@srowen
Copy link
Member

srowen commented Mar 29, 2016

@mtustin-handy you can just push new commits to the branch.

zhzhan pushed a commit to zhzhan/spark that referenced this pull request Apr 4, 2016
…le RDDs of size 1

## What changes were proposed in this pull request?

This special cases 0 and 1 counts to avoid passing 0 degrees of freedom.

## How was this patch tested?

Tests run successfully. New test added.

## Note:
This recreates apache#11982 which was closed to due to non-updated diff. rxin srowen Commented there.
This also adds tests, reworks the code to perform the special casing (based on srowen's comments), and adds equality machinery for BoundedDouble, as well as changing how it is transformed to string.

Author: Marcin Tustin <[email protected]>
Author: Marcin Tustin <[email protected]>

Closes apache#12016 from mtustin-handy/SPARK-14163.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants