Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8018][MLlib]KMeans should accept initial cluster centers as param #6737

Closed
wants to merge 16 commits into from

Conversation

FlytxtRnD
Copy link
Contributor

This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run.

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34569 has finished for PR 6737 at commit 6959861.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (model.k == k) {
initialModel = Some(model)
} else {
throw new IllegalArgumentException("mismatched cluster count (model.k != k)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just require this condition upfront?

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34682 has finished for PR 6737 at commit e9c35d7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FlytxtRnD
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34687 timed out for PR 6737 at commit e9c35d7 after a configured wait of 175m.

@FlytxtRnD
Copy link
Contributor Author

Can somebody help me with this test failure?

@srowen
Copy link
Member

srowen commented Jun 12, 2015

Jenkins, retest this please

@srowen
Copy link
Member

srowen commented Jun 12, 2015

It may be a problem with the pR builder

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34753 has finished for PR 6737 at commit e9c35d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34754 has finished for PR 6737 at commit e9c35d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FlytxtRnD
Copy link
Contributor Author

Does this patch look fine to merge?

* @param maxIterations max number of iterations
* @param initialModel an existing set of cluster centers.
*/
def train(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure at this point what the thinking is on adding yet another overload to the utility method. At some point one is expected to use KMeans directly, and I recall some move to stop adding these utility methods. But I am not sure -- @mengxr @jkbradley any opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This extra static method is not necessary since we decided we prefer the builder pattern, as @srowen said.

@FlytxtRnD
Copy link
Contributor Author

@mengxr @jkbradley Could you please comment on @srowen 's note above ?

@FlytxtRnD
Copy link
Contributor Author

@jkbradley Gentle remainder.

// random or k-means|| initializationMode
private var initialModel: Option[KMeansModel] = None

/** Set the initial starting point, bypassing the random initialization or k-means||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala style: comment should begin on line after /** (See other examples of multi-line comments in this file.)

@FlytxtRnD
Copy link
Contributor Author

I am updating the PR with the suggested changes. Only one run condition is handled by adding a require in setInitialModel. I will modify it based on further suggestions.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35261 has finished for PR 6737 at commit 3f5fc8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2015

Test build #36487 has finished for PR 6737 at commit d12336e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36561 has finished for PR 6737 at commit 06d13ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley please review

}
// Only one run is allowed when initialModel is given
val numRuns = if (initialModel.nonEmpty) 1 else runs
logWarning("Ignoring runs; one run is allowed when initialModel is given.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please print warning only if initialModel.nonEmpty && runs > 1

@jkbradley
Copy link
Member

Just those 2 minor items

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36792 has finished for PR 6737 at commit c446c58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
// Only one run is allowed when initialModel is given
val numRuns = if (initialModel.nonEmpty){
if (runs >1 ) logWarning("Ignoring runs; one run is allowed when initialModel is given.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be careful about Scala style. Look at [https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide] or other parts of the codebase for examples. We try hard to keep a consistent style. Here:

val numRuns = if (initialModel.nonEmpty) {
  if (runs > 1) logWarning("Ignoring runs; one run is allowed when initialModel is given.")
  1
} else {
  runs
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley the if statement is given 2 space indentation..tat's correct rgt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace your current code with the snippet I wrote above (defining numRuns). The current code does not follow the style guide.

@jkbradley
Copy link
Member

Looks good except for that style issue

@jkbradley
Copy link
Member

We are trying to improve the style checker, but it's a difficult task.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley sorry for the repeating the style errors..I hope the documentation added is also fine.

@jkbradley
Copy link
Member

@FlytxtRnD Yep, thanks for adding that doc!

@FlytxtRnD
Copy link
Contributor Author

@jkbradley Is this PR ready for merge ? Please let us know if there is anything more to do.

@jkbradley
Copy link
Member

@FlytxtRnD Can you please fix that style error I noted above?

@FlytxtRnD
Copy link
Contributor Author

@jkbradley The style error reported above has already been fixed in the previous commit. Is there any other style issue that has to resolved ??

@bhupendramishra
Copy link

Hi All,
I m having issue with starting for name node, can some one please help . i
m having following error.
I understand this could not be right community however any help will be
highly appreciating...

2015-07-13 02:51:23,168 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode
metrics system... 2015-07-13 02:51:23,169 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system
stopped. 2015-07-13 02:51:23,169 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system
shutdown complete. 2015-07-13 02:51:23,169 FATAL
org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error
replaying edit log at offset 1048576. Expected transaction ID was 332187
Recent opcode offsets: 25 at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:197)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:137)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:820)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:678)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1006)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:736)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:531)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:587)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:754) at
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:738) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1427)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1493)
Caused by:
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException:
got premature end-of-file at txid 332186; expected file to go up to 332192
at
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
at
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
... 12 more

On Mon, Jul 13, 2015 at 1:45 PM, FlytxtRnD [email protected] wrote:

@jkbreadley The style error reported above has already been fixed in the
previous commit. Is there any other style issue that has to resolved ??


Reply to this email directly or view it on GitHub
#6737 (comment).

@jkbradley
Copy link
Member

@FlytxtRnD I just commented again on the lines with the style problem. If you believe it has been fixed, maybe you forgot to push an update?

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37192 has finished for PR 6737 at commit ef95ee2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37197 has finished for PR 6737 at commit 94b56df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FlytxtRnD
Copy link
Contributor Author

@jkbradley Please merge if everything seems to be fine

@jkbradley
Copy link
Member

LGTM, merging with master
Thank you for the PR!

@asfgit asfgit closed this in 3f6296f Jul 15, 2015
@FlytxtRnD
Copy link
Contributor Author

@jkbradley Thank you so much for your help and co-operation.

@xjlin0
Copy link

xjlin0 commented Sep 23, 2015

Hi Folks, really appreciate your efforts to create such fabulous framework. One question:

Under PySpark shell, I tried to create a KMeans model with initialModel but got an error:

kmeans = KMeans.train(parsedData, 10, maxIterations=20, initialModel=initial_points)
TypeError: train() got an unexpected keyword argument 'initialModel'

Same error prompted when using setInitialModel=. Could anybody let me know how could I set initial cluster centers or any documentation/examples of initialModel in KMeans? Thanks again!

@jkbradley
Copy link
Member

Thanks for pointing this out! It's a missing feature in PySpark currently. I just made a JIRA: [https://issues.apache.org/jira/browse/SPARK-10779]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants