-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-7879][MLlib] KMeans API for spark.ml Pipelines #6756
Conversation
Test build #34662 has finished for PR 6756 at commit
|
This failure was caused by |
Jenkins, test this please. |
Test build #34696 has finished for PR 6756 at commit
|
Test build #34780 has finished for PR 6756 at commit
|
@jkbradley Could you review this code at your earliest convenience? Thanks! |
Test build #35563 has finished for PR 6756 at commit
|
Sorry, I did rebase my PR. Beause I must also support the new |
reviewing now... |
import org.apache.spark.ml.param.{Param, ParamMap, Params} | ||
import org.apache.spark.ml.util.{Identifiable, SchemaUtils} | ||
import org.apache.spark.ml.{Estimator, Model} | ||
import org.apache.spark.mllib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
Test build #37475 has finished for PR 6756 at commit
|
Test build #37479 has finished for PR 6756 at commit
|
/** | ||
* Param the number of runs of the algorithm to execute in parallel. We initialize the algorithm | ||
* this many times with random starting conditions (configured by the initialization mode), then | ||
* return the best clustering found over any run. Default: 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed: Here and elsewhere, can you please state in the Param Scala doc the constraints (in this case "Must be >= 1")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Thanks for the updates! A few more comments, but only small items |
…st or not in Python
Test build #37587 has finished for PR 6756 at commit
|
Test build #37589 has finished for PR 6756 at commit
|
val transformed = model.transform(dataset) | ||
val expectedColumns = Array("features", predictionColName) | ||
expectedColumns.foreach { column => | ||
transformed.columns.contains(column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to assert
The changes look good, save for those 2 tiny items. That should be all! |
Oh, I'm sorry for the easy mistakes... |
Test build #37674 has finished for PR 6756 at commit
|
LGTM thanks for contributing this big feature! |
Thank you for merging it and your continuous support! |
I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
[SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879