-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8018][MLlib]KMeans should accept initial cluster centers as param #6737
Conversation
Test build #34569 has finished for PR 6737 at commit
|
if (model.k == k) { | ||
initialModel = Some(model) | ||
} else { | ||
throw new IllegalArgumentException("mismatched cluster count (model.k != k)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just require
this condition upfront?
Test build #34682 has finished for PR 6737 at commit
|
Jenkins, retest this please |
Test build #34687 timed out for PR 6737 at commit |
Can somebody help me with this test failure? |
Jenkins, retest this please |
It may be a problem with the pR builder |
Test build #34753 has finished for PR 6737 at commit
|
Test build #34754 has finished for PR 6737 at commit
|
Does this patch look fine to merge? |
* @param maxIterations max number of iterations | ||
* @param initialModel an existing set of cluster centers. | ||
*/ | ||
def train( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure at this point what the thinking is on adding yet another overload to the utility method. At some point one is expected to use KMeans
directly, and I recall some move to stop adding these utility methods. But I am not sure -- @mengxr @jkbradley any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This extra static method is not necessary since we decided we prefer the builder pattern, as @srowen said.
@mengxr @jkbradley Could you please comment on @srowen 's note above ? |
@jkbradley Gentle remainder. |
// random or k-means|| initializationMode | ||
private var initialModel: Option[KMeansModel] = None | ||
|
||
/** Set the initial starting point, bypassing the random initialization or k-means|| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala style: comment should begin on line after /**
(See other examples of multi-line comments in this file.)
I am updating the PR with the suggested changes. Only one run condition is handled by adding a |
Test build #35261 has finished for PR 6737 at commit
|
Test build #36487 has finished for PR 6737 at commit
|
Test build #36561 has finished for PR 6737 at commit
|
@jkbradley please review |
} | ||
// Only one run is allowed when initialModel is given | ||
val numRuns = if (initialModel.nonEmpty) 1 else runs | ||
logWarning("Ignoring runs; one run is allowed when initialModel is given.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please print warning only if initialModel.nonEmpty && runs > 1
Just those 2 minor items |
Test build #36792 has finished for PR 6737 at commit
|
} | ||
// Only one run is allowed when initialModel is given | ||
val numRuns = if (initialModel.nonEmpty){ | ||
if (runs >1 ) logWarning("Ignoring runs; one run is allowed when initialModel is given.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be careful about Scala style. Look at [https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide] or other parts of the codebase for examples. We try hard to keep a consistent style. Here:
val numRuns = if (initialModel.nonEmpty) {
if (runs > 1) logWarning("Ignoring runs; one run is allowed when initialModel is given.")
1
} else {
runs
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley the if statement is given 2 space indentation..tat's correct rgt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace your current code with the snippet I wrote above (defining numRuns). The current code does not follow the style guide.
Looks good except for that style issue |
We are trying to improve the style checker, but it's a difficult task. |
@jkbradley sorry for the repeating the style errors..I hope the documentation added is also fine. |
@FlytxtRnD Yep, thanks for adding that doc! |
@jkbradley Is this PR ready for merge ? Please let us know if there is anything more to do. |
@FlytxtRnD Can you please fix that style error I noted above? |
@jkbradley The style error reported above has already been fixed in the previous commit. Is there any other style issue that has to resolved ?? |
Hi All, 2015-07-13 02:51:23,168 INFO On Mon, Jul 13, 2015 at 1:45 PM, FlytxtRnD [email protected] wrote:
|
@FlytxtRnD I just commented again on the lines with the style problem. If you believe it has been fixed, maybe you forgot to push an update? |
Test build #37192 has finished for PR 6737 at commit
|
Test build #37197 has finished for PR 6737 at commit
|
@jkbradley Please merge if everything seems to be fine |
LGTM, merging with master |
@jkbradley Thank you so much for your help and co-operation. |
Hi Folks, really appreciate your efforts to create such fabulous framework. One question: Under PySpark shell, I tried to create a KMeans model with initialModel but got an error:
Same error prompted when using setInitialModel=. Could anybody let me know how could I set initial cluster centers or any documentation/examples of initialModel in KMeans? Thanks again! |
Thanks for pointing this out! It's a missing feature in PySpark currently. I just made a JIRA: [https://issues.apache.org/jira/browse/SPARK-10779] |
This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run.