-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7241] Pearson correlation for DataFrames #5858
Conversation
} | ||
|
||
/** | ||
* Java Friendly implementation to calculate the Pearson correlation coefficient of two columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's not java friendly?
Test build #31646 has finished for PR 5858 at commit
|
* @return The Pearson Correlation Coefficient as a Double. | ||
*/ | ||
def corr(col1: String, col2: String, method: String): Double = { | ||
assert(method == "pearson", "Currently only the calculation of the Pearson Correlation " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.
assert can be turned off ... and assert should only be used to check internal invariants.
I will let @mengxr comment on the math part. |
Test build #31652 has finished for PR 5858 at commit
|
Test build #31662 has finished for PR 5858 at commit
|
/** Calculate the Pearson Correlation Coefficient for the given columns */ | ||
private[sql] def pearsonCorrelation(df: DataFrame, cols: Seq[String]): Double = { | ||
val counts = collectStatisticalData(df, cols) | ||
counts.Ck / math.sqrt(counts.MkX * counts.MkY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be the sample correlation as well. In the unit tests, please provide R commands that compute the correlation and the result, and verify that we output the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, isn't it? The n - 1's cancel. I tested with sciPy Pearsonr method.
That's why I have the non trivial (i, sqrt(i)) test.
On May 2, 2015 10:32 AM, "Xiangrui Meng" [email protected] wrote:
In
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
#5858 (comment):@@ -23,29 +23,43 @@ import org.apache.spark.sql.types.{DoubleType, NumericType}
private[sql] object StatFunctions {
- /** Calculate the Pearson Correlation Coefficient for the given columns */
- private[sql] def pearsonCorrelation(df: DataFrame, cols: Seq[String]): Double = {
- val counts = collectStatisticalData(df, cols)
- counts.Ck / math.sqrt(counts.MkX * counts.MkY)
Should be the sample correlation as well. In the unit tests, please
provide R commands that compute the correlation and the result, and verify
that we output the same value.—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5858/files#r29549089.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, they canceled each other. Could you add a non-trivial test to Scala? Now it only has 0.0
, 1.0
, and -1.0
. In your test, please provide Python/R commands in comments to reproduce the result. This makes easier for others to verify.
Test build #31698 has finished for PR 5858 at commit
|
// > a <- 0:19 | ||
// > b <- mapply(function(x) x * x - 2 * x + 3.5, a) | ||
// > cor(a, b) | ||
// [1] 0.957233913947585835 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this test!
LGTM. |
Thanks. Merging in master. |
submitting this PR from a phone, excuse the brevity. adds Pearson correlation to Dataframes, reusing the covariance calculation code cc mengxr rxin Author: Burak Yavuz <[email protected]> Closes apache#5858 from brkyvz/df-corr and squashes the following commits: 285b838 [Burak Yavuz] addressed comments v2.0 d10babb [Burak Yavuz] addressed comments v0.2 4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr 4fe693b [Burak Yavuz] addressed comments v0.1 a682d06 [Burak Yavuz] ready for PR
submitting this PR from a phone, excuse the brevity. adds Pearson correlation to Dataframes, reusing the covariance calculation code cc mengxr rxin Author: Burak Yavuz <[email protected]> Closes apache#5858 from brkyvz/df-corr and squashes the following commits: 285b838 [Burak Yavuz] addressed comments v2.0 d10babb [Burak Yavuz] addressed comments v0.2 4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr 4fe693b [Burak Yavuz] addressed comments v0.1 a682d06 [Burak Yavuz] ready for PR
submitting this PR from a phone, excuse the brevity. adds Pearson correlation to Dataframes, reusing the covariance calculation code cc mengxr rxin Author: Burak Yavuz <[email protected]> Closes apache#5858 from brkyvz/df-corr and squashes the following commits: 285b838 [Burak Yavuz] addressed comments v2.0 d10babb [Burak Yavuz] addressed comments v0.2 4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr 4fe693b [Burak Yavuz] addressed comments v0.1 a682d06 [Burak Yavuz] ready for PR
submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code
cc @mengxr @rxin