[Spark-7879][MLlib] KMeans API for spark.ml Pipelines #6756

yu-iskw · 2015-06-11T04:28:54Z

I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.

[SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879

SparkQA · 2015-06-11T05:39:37Z

Test build #34662 has finished for PR 6756 at commit a6325fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
- case class Log2(child: Expression)
- case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes

yu-iskw · 2015-06-11T05:51:55Z

This failure was caused by org.apache.spark.streaming.StreamingListenerSuite.It seems that there is no failure about this issue. How do I deal with this failure?

yu-iskw · 2015-06-11T16:40:39Z

Jenkins, test this please.

SparkQA · 2015-06-11T18:19:31Z

Test build #34696 has finished for PR 6756 at commit a6325fc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

SparkQA · 2015-06-12T18:47:57Z

Test build #34780 has finished for PR 6756 at commit fea6b07.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

yu-iskw · 2015-06-12T18:55:42Z

@jkbradley Could you review this code at your earliest convenience? Thanks!

SparkQA · 2015-06-23T17:08:46Z

Test build #35563 has finished for PR 6756 at commit 12bdb04.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

yu-iskw · 2015-06-23T18:11:17Z

Sorry, I did rebase my PR. Beause I must also support the new copy methods for KMeans.

jkbradley · 2015-06-23T19:52:40Z

reviewing now...

jkbradley · 2015-06-23T19:53:23Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib


SparkQA · 2015-07-16T09:53:48Z

Test build #37475 has finished for PR 6756 at commit 19a9d63.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

SparkQA · 2015-07-16T10:05:24Z

Test build #37479 has finished for PR 6756 at commit c8dc6e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

jkbradley · 2015-07-17T05:12:38Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We initialize the algorithm
+   * this many times with random starting conditions (configured by the initialization mode), then
+   * return the best clustering found over any run. Default: 1.


Just noticed: Here and elsewhere, can you please state in the Param Scala doc the constraints (in this case "Must be >= 1")?

jkbradley · 2015-07-17T05:13:16Z

Thanks for the updates! A few more comments, but only small items

…st or not in Python

SparkQA · 2015-07-17T07:52:37Z

Test build #37587 has finished for PR 6756 at commit 4c61693.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

SparkQA · 2015-07-17T07:57:22Z

Test build #37589 has finished for PR 6756 at commit a14939b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

jkbradley · 2015-07-17T21:05:36Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+    val transformed = model.transform(dataset)
+    val expectedColumns = Array("features", predictionColName)
+    expectedColumns.foreach { column =>
+      transformed.columns.contains(column)


need to assert

jkbradley · 2015-07-17T21:05:56Z

The changes look good, save for those 2 tiny items. That should be all!

yu-iskw · 2015-07-17T22:36:37Z

Oh, I'm sorry for the easy mistakes...

SparkQA · 2015-07-18T00:54:33Z

Test build #37674 has finished for PR 6756 at commit be752de.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
- class KMeansModel(JavaModel):
- class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):

jkbradley · 2015-07-18T01:29:36Z

LGTM thanks for contributing this big feature!
Merging with master

yu-iskw · 2015-07-18T03:05:24Z

Thank you for merging it and your continuous support!

yu-iskw changed the title ~~[Spark 7879][MLlib] KMeans API for spark.ml Pipelines~~ [Spark-7879][MLlib] KMeans API for spark.ml Pipelines Jun 11, 2015

yu-iskw force-pushed the SPARK-7879 branch from 12bdb04 to a34772e Compare June 23, 2015 18:10

jkbradley reviewed Jun 23, 2015
View reviewed changes

yu-iskw added 3 commits July 16, 2015 16:33

Add the statements about spark.ml.clustering into pyspark.ml.rst

1abb19c

Include spark.ml.clustering to python tests

19a9d63

Remove an unnecessary test

c8dc6e6

jkbradley reviewed Jul 17, 2015
View reviewed changes

yu-iskw added 6 commits July 17, 2015 14:16

Using expertSetParam and expertGetParam

effc650

Add the Scala docs about the constraints of each parameter.

ca78b7d

Switch the comparisons.

f397be4

Use getInt, instead of get

fb2417c

Remove the test about whether "features" and "prediction" columns exi…

4c61693

…st or not in Python

Fix the dashed line's length in pyspark.ml.rst

a14939b

jkbradley reviewed Jul 17, 2015
View reviewed changes

Add assertions

be752de

asfgit closed this in 34a889d Jul 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-7879][MLlib] KMeans API for spark.ml Pipelines #6756

[Spark-7879][MLlib] KMeans API for spark.ml Pipelines #6756

yu-iskw commented Jun 11, 2015

SparkQA commented Jun 11, 2015

yu-iskw commented Jun 11, 2015

yu-iskw commented Jun 11, 2015

SparkQA commented Jun 11, 2015

SparkQA commented Jun 12, 2015

yu-iskw commented Jun 12, 2015

SparkQA commented Jun 23, 2015

yu-iskw commented Jun 23, 2015

jkbradley commented Jun 23, 2015

jkbradley Jun 23, 2015

SparkQA commented Jul 16, 2015

SparkQA commented Jul 16, 2015

jkbradley Jul 17, 2015

yu-iskw Jul 17, 2015

jkbradley commented Jul 17, 2015

SparkQA commented Jul 17, 2015

SparkQA commented Jul 17, 2015

jkbradley Jul 17, 2015

jkbradley commented Jul 17, 2015

yu-iskw commented Jul 17, 2015

SparkQA commented Jul 18, 2015

jkbradley commented Jul 18, 2015

yu-iskw commented Jul 18, 2015

[Spark-7879][MLlib] KMeans API for spark.ml Pipelines #6756

[Spark-7879][MLlib] KMeans API for spark.ml Pipelines #6756

Conversation

yu-iskw commented Jun 11, 2015

SparkQA commented Jun 11, 2015

yu-iskw commented Jun 11, 2015

yu-iskw commented Jun 11, 2015

SparkQA commented Jun 11, 2015

SparkQA commented Jun 12, 2015

yu-iskw commented Jun 12, 2015

SparkQA commented Jun 23, 2015

yu-iskw commented Jun 23, 2015

jkbradley commented Jun 23, 2015

jkbradley Jun 23, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2015

SparkQA commented Jul 16, 2015

jkbradley Jul 17, 2015

Choose a reason for hiding this comment

yu-iskw Jul 17, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 17, 2015

SparkQA commented Jul 17, 2015

SparkQA commented Jul 17, 2015

jkbradley Jul 17, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 17, 2015

yu-iskw commented Jul 17, 2015

SparkQA commented Jul 18, 2015

jkbradley commented Jul 18, 2015

yu-iskw commented Jul 18, 2015