[SPARK-8018][MLlib]KMeans should accept initial cluster centers as param #6737

FlytxtRnD · 2015-06-10T06:51:09Z

This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run.

SparkQA · 2015-06-10T08:38:23Z

Test build #34569 has finished for PR 6737 at commit 6959861.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-06-10T09:49:18Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+    if (model.k == k) {
+      initialModel = Some(model)
+    } else {
+      throw new IllegalArgumentException("mismatched cluster count (model.k != k)")


Just require this condition upfront?

SparkQA · 2015-06-11T10:26:25Z

Test build #34682 has finished for PR 6737 at commit e9c35d7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

FlytxtRnD · 2015-06-11T11:20:08Z

Jenkins, retest this please

SparkQA · 2015-06-11T16:17:31Z

Test build #34687 timed out for PR 6737 at commit e9c35d7 after a configured wait of 175m.

FlytxtRnD · 2015-06-12T06:18:40Z

Can somebody help me with this test failure?

srowen · 2015-06-12T06:22:07Z

Jenkins, retest this please

srowen · 2015-06-12T06:22:14Z

It may be a problem with the pR builder

SparkQA · 2015-06-12T08:10:12Z

Test build #34753 has finished for PR 6737 at commit e9c35d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-12T08:14:18Z

Test build #34754 has finished for PR 6737 at commit e9c35d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

FlytxtRnD · 2015-06-15T06:06:01Z

Does this patch look fine to merge?

srowen · 2015-06-15T07:16:18Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+   * @param maxIterations max number of iterations
+   * @param initialModel an existing set of cluster centers.
+   */
+  def train(


I'm not sure at this point what the thinking is on adding yet another overload to the utility method. At some point one is expected to use KMeans directly, and I recall some move to stop adding these utility methods. But I am not sure -- @mengxr @jkbradley any opinion?

I agree. This extra static method is not necessary since we decided we prefer the builder pattern, as @srowen said.

…into Kmeans-8018

FlytxtRnD · 2015-06-16T12:44:07Z

@mengxr @jkbradley Could you please comment on @srowen 's note above ?

FlytxtRnD · 2015-06-18T07:37:05Z

@jkbradley Gentle remainder.

jkbradley · 2015-06-18T21:28:57Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  // random or k-means|| initializationMode
+  private var initialModel: Option[KMeansModel] = None
+
+  /** Set the initial starting point, bypassing the random initialization or k-means||


Scala style: comment should begin on line after /** (See other examples of multi-line comments in this file.)

FlytxtRnD · 2015-06-19T10:15:07Z

I am updating the PR with the suggested changes. Only one run condition is handled by adding a require in setInitialModel. I will modify it based on further suggestions.

SparkQA · 2015-06-19T11:36:02Z

Test build #35261 has finished for PR 6737 at commit 3f5fc8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-03T07:47:37Z

Test build #36487 has finished for PR 6737 at commit d12336e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-06T06:23:11Z

Test build #36561 has finished for PR 6737 at commit 06d13ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

FlytxtRnD · 2015-07-07T04:04:35Z

@jkbradley please review

jkbradley · 2015-07-07T23:29:54Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    }
+    // Only one run is allowed when initialModel is given
+    val numRuns = if (initialModel.nonEmpty) 1 else runs
+    logWarning("Ignoring runs; one run is allowed when initialModel is given.")


Please print warning only if initialModel.nonEmpty && runs > 1

jkbradley · 2015-07-07T23:30:18Z

Just those 2 minor items

SparkQA · 2015-07-08T13:41:25Z

Test build #36792 has finished for PR 6737 at commit c446c58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-09T01:54:38Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    }
+    // Only one run is allowed when initialModel is given
+    val numRuns = if (initialModel.nonEmpty){
+      if (runs >1 ) logWarning("Ignoring runs; one run is allowed when initialModel is given.")


Please be careful about Scala style. Look at [https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide] or other parts of the codebase for examples. We try hard to keep a consistent style. Here:

val numRuns = if (initialModel.nonEmpty) { if (runs > 1) logWarning("Ignoring runs; one run is allowed when initialModel is given.") 1 } else { runs }

@jkbradley the if statement is given 2 space indentation..tat's correct rgt?

That's correct.

Please replace your current code with the snippet I wrote above (defining numRuns). The current code does not follow the style guide.

jkbradley · 2015-07-09T01:54:53Z

Looks good except for that style issue

jkbradley · 2015-07-09T01:55:06Z

We are trying to improve the style checker, but it's a difficult task.

FlytxtRnD · 2015-07-09T04:02:27Z

@jkbradley sorry for the repeating the style errors..I hope the documentation added is also fine.

jkbradley · 2015-07-09T21:52:38Z

@FlytxtRnD Yep, thanks for adding that doc!

FlytxtRnD · 2015-07-10T04:12:09Z

@jkbradley Is this PR ready for merge ? Please let us know if there is anything more to do.

jkbradley · 2015-07-10T18:16:53Z

@FlytxtRnD Can you please fix that style error I noted above?

FlytxtRnD · 2015-07-13T08:15:11Z

@jkbradley The style error reported above has already been fixed in the previous commit. Is there any other style issue that has to resolved ??

bhupendramishra · 2015-07-13T08:23:29Z

Hi All,
I m having issue with starting for name node, can some one please help . i
m having following error.
I understand this could not be right community however any help will be
highly appreciating...

2015-07-13 02:51:23,168 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode
metrics system... 2015-07-13 02:51:23,169 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system
stopped. 2015-07-13 02:51:23,169 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system
shutdown complete. 2015-07-13 02:51:23,169 FATAL
org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error
replaying edit log at offset 1048576. Expected transaction ID was 332187
Recent opcode offsets: 25 at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:197)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:137)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:820)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:678)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1006)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:736)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:531)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:587)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:754) at
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:738) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1427)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1493)
Caused by:
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException:
got premature end-of-file at txid 332186; expected file to go up to 332192
at
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
at
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
... 12 more

On Mon, Jul 13, 2015 at 1:45 PM, FlytxtRnD [email protected] wrote:

@jkbreadley The style error reported above has already been fixed in the
previous commit. Is there any other style issue that has to resolved ??

—
Reply to this email directly or view it on GitHub
#6737 (comment).

jkbradley · 2015-07-13T18:06:10Z

@FlytxtRnD I just commented again on the lines with the style problem. If you believe it has been fixed, maybe you forgot to push an update?

SparkQA · 2015-07-14T04:27:39Z

Test build #37192 has finished for PR 6737 at commit ef95ee2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-14T05:45:26Z

Test build #37197 has finished for PR 6737 at commit 94b56df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

FlytxtRnD · 2015-07-15T03:52:19Z

@jkbradley Please merge if everything seems to be fine

jkbradley · 2015-07-15T06:28:37Z

LGTM, merging with master
Thank you for the PR!

FlytxtRnD · 2015-07-15T06:31:54Z

@jkbradley Thank you so much for your help and co-operation.

xjlin0 · 2015-09-23T14:36:27Z

Hi Folks, really appreciate your efforts to create such fabulous framework. One question:

Under PySpark shell, I tried to create a KMeans model with initialModel but got an error:

kmeans = KMeans.train(parsedData, 10, maxIterations=20, initialModel=initial_points)
TypeError: train() got an unexpected keyword argument 'initialModel'

Same error prompted when using setInitialModel=. Could anybody let me know how could I set initial cluster centers or any documentation/examples of initialModel in KMeans? Thanks again!

jkbradley · 2015-09-23T18:41:27Z

Thanks for pointing this out! It's a missing feature in PySpark currently. I just made a JIRA: [https://issues.apache.org/jira/browse/SPARK-10779]

Accept initial cluster centers in KMeans

6959861

srowen reviewed Jun 10, 2015
View reviewed changes

Remove getInitialModel and match cluster count criteria

e9c35d7

srowen reviewed Jun 15, 2015
View reviewed changes

Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' …

16f1b53

…into Kmeans-8018

jkbradley reviewed Jun 18, 2015
View reviewed changes

FlytxtRnD added 2 commits June 19, 2015 09:52

Merge remote-tracking branch 'upstream/master' into Kmeans-8018

cd5dc5c

test case modified and one runs condition added

3f5fc8e

numRuns variable modifications

d12336e

numRuns corrected

06d13ef

jkbradley reviewed Jul 7, 2015
View reviewed changes

documentation and numRuns warning change

c446c58

jkbradley reviewed Jul 9, 2015
View reviewed changes

style correction

ef95ee2

style correction

94b56df

asfgit closed this in 3f6296f Jul 15, 2015

[SPARK-8018][MLlib]KMeans should accept initial cluster centers as param #6737

[SPARK-8018][MLlib]KMeans should accept initial cluster centers as param #6737

Conversation

FlytxtRnD commented Jun 10, 2015

SparkQA commented Jun 10, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 11, 2015

FlytxtRnD commented Jun 11, 2015

SparkQA commented Jun 11, 2015

FlytxtRnD commented Jun 12, 2015

srowen commented Jun 12, 2015

srowen commented Jun 12, 2015

SparkQA commented Jun 12, 2015

SparkQA commented Jun 12, 2015

FlytxtRnD commented Jun 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FlytxtRnD commented Jun 16, 2015

FlytxtRnD commented Jun 18, 2015

Choose a reason for hiding this comment

FlytxtRnD commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jul 3, 2015

SparkQA commented Jul 6, 2015

FlytxtRnD commented Jul 7, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 7, 2015

SparkQA commented Jul 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Jul 9, 2015

jkbradley commented Jul 9, 2015

FlytxtRnD commented Jul 9, 2015

jkbradley commented Jul 9, 2015

FlytxtRnD commented Jul 10, 2015

jkbradley commented Jul 10, 2015

FlytxtRnD commented Jul 13, 2015

bhupendramishra commented Jul 13, 2015

jkbradley commented Jul 13, 2015

SparkQA commented Jul 14, 2015

SparkQA commented Jul 14, 2015

FlytxtRnD commented Jul 15, 2015

jkbradley commented Jul 15, 2015

FlytxtRnD commented Jul 15, 2015

xjlin0 commented Sep 23, 2015

jkbradley commented Sep 23, 2015