Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5021] [MLlib] Gaussian Mixture now supports Sparse Input #4459

Closed
wants to merge 5 commits into from

Conversation

MechCoder
Copy link
Contributor

Following discussion in the Jira.

@MechCoder
Copy link
Contributor Author

ping @tgaloppo and @jkbradley (whenever you are back!)

@SparkQA
Copy link

SparkQA commented Feb 8, 2015

Test build #27049 has started for PR 4459 at commit 6d8d4bb.

  • This patch merges cleanly.


import org.apache.spark.annotation.Experimental
import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, Matrices, Vector, Vectors}
import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors, DenseVector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphabetize these

@SparkQA
Copy link

SparkQA commented Feb 8, 2015

Test build #27049 has finished for PR 4459 at commit 6d8d4bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27049/
Test PASSed.

@MechCoder
Copy link
Contributor Author

@tgaloppo Thanks. I shall fix them in a while. However, does the general code look good to you?

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27087 has started for PR 4459 at commit c235bd0.

  • This patch merges cleanly.

@MechCoder MechCoder changed the title [SPARK-5021] Gaussian Mixture now supports Sparse Input [SPARK-5021] [MLlib] Gaussian Mixture now supports Sparse Input Feb 9, 2015
@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27087 has finished for PR 4459 at commit c235bd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27087/
Test PASSed.

def syr(alpha: Double, x: SparseVector, A: DenseMatrix) {
val mA = A.numRows
val nA = A.numCols
require(mA == nA, s"A is not a square matrix. A: $mA x $nA")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Symmetry is a requirement of the syr() function; what is being performed is an incomplete check. The old message mismatched the check, but matched the requirement; perhaps this could be something like "A is not square, and thus cannot be symmetric" ... or something along that line?

@MechCoder
Copy link
Contributor Author

@tgaloppo Thanks for your valuable feedback. Do you have anything more to add as of now?

@tgaloppo
Copy link
Contributor

tgaloppo commented Feb 9, 2015

@MechCoder Nothing else stands out to me... I will give it another look after your next commit.

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27107 has started for PR 4459 at commit 8580799.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@tgaloppo I fixed it up. Can you have a look?

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27108 has started for PR 4459 at commit 3154f6f.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

Also a noob question, but what is the significance of the negative variance in the tests?

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27107 has finished for PR 4459 at commit 8580799.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27107/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27108 has finished for PR 4459 at commit 3154f6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27108/
Test PASSed.

@tgaloppo
Copy link
Contributor

tgaloppo commented Feb 9, 2015

@MechCoder Do you mean the negative values in the covariance (sigma) matrices? Negative covariance indicates, roughly speaking, that variables move in opposite directions... as one increases, the other decreases (positive covariance indicate they move together, and zero covariance indicate they move independently). The covariance matrix has the variance of the individual vector components on the diagonal, and the covariance between vector components off the diagonal (and is symmetric). Does this help??

val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data)
assert(gmm.weights(0) ~== Ew absTol 1E-5)
assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check covariance matrix here. The Esigma given here does not look right. If you need help computing it, let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake; I am sorry.

@tgaloppo
Copy link
Contributor

tgaloppo commented Feb 9, 2015

@MechCoder Getting close; just need to finish up the sparse single cluster test.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27171 has started for PR 4459 at commit 0679440.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27171/
Test PASSed.

@MechCoder
Copy link
Contributor Author

@tgaloppo Alright, thanks for the explanation. Can I know, what makes you think that the covariance matrix is wrong? I calculated it manually and it seems to be right. I added the test and it also passes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27179 has started for PR 4459 at commit e579041.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27179 has finished for PR 4459 at commit e579041.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27179/
Test PASSed.

@tgaloppo
Copy link
Contributor

LGTM

cc: @jkbradley @mengxr

@MechCoder
Copy link
Contributor Author

@mengxr Just to make it easier for you, a small description. GaussianMixture used to support sparse input, by converting it to DenseVectors, which is non-optimal, in case of a large number of zeros. This PR makes sure that the model works without this conversion happening.

@@ -235,12 +235,23 @@ private[spark] object BLAS extends Serializable with Logging {
* @param x the vector x that contains the n elements.
* @param A the symmetric matrix A. Size of n x n.
*/
def syr(alpha: Double, x: DenseVector, A: DenseMatrix) {

// Wrapper around the dense and sparse methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of the JavaDoc. We can either move it to the JavaDoc or to the method closure.

@MechCoder
Copy link
Contributor Author

@mengxr Fixed up your comments. Let me know if there is anything else.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27230 has started for PR 4459 at commit b997aa3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27231 has started for PR 4459 at commit 1b18dab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27230 has finished for PR 4459 at commit b997aa3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27230/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27231 has finished for PR 4459 at commit 1b18dab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27231/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 10, 2015
Following discussion in the Jira.

Author: MechCoder <[email protected]>

Closes #4459 from MechCoder/sparse_gmm and squashes the following commits:

1b18dab [MechCoder] Rewrite syr for sparse matrices
e579041 [MechCoder] Add test for covariance matrix
5cb370b [MechCoder] Separate tests for sparse data
5e096bd [MechCoder] Alphabetize and correct error message
e180f4c [MechCoder] [SPARK-5021] Gaussian Mixture now supports Sparse Input

(cherry picked from commit fd2c032)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Feb 10, 2015

LGTM. Merged into master and branch-1.3. Thanks!

@asfgit asfgit closed this in fd2c032 Feb 10, 2015
@MechCoder MechCoder deleted the sparse_gmm branch February 11, 2015 04:04
@MechCoder
Copy link
Contributor Author

Thanks @tgaloppo and @mengxr . Any idea what to touch in GaussianMixture next? The parallelized Gaussian initialization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants