-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming KMeans [MLLIB][SPARK-3254] #2942
Conversation
- Used trainOn and predictOn pattern, similar to StreamingLinearAlgorithm - Decay factor can be set explicitly, or via fractional decay parameters expressed in units of number of batches, or number of points - Unit tests for basic functionality and decay settings
Test build #22209 has started for PR 2942 at commit
|
Test build #22209 has finished for PR 2942 at commit
|
Test PASSed. |
Should we create another PR for the python bindings/example? |
|
||
@DeveloperApi | ||
class StreamingKMeans( | ||
var k: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent of 4 spaces?
@anantasty This PR is still in review. If you are interested in Python binding of streaming algorithms. Could you help add one for StreamingLinearRegression? Thanks! |
I would certainly be interested in doing that. I just wasn't sure if it was
|
It should be in a separate JIRA (and hence a separate PR). Please create a JIRA for |
@anantasty Agreed, should be separate, but would be very cool to have! Ping me as well, happy to provide feedback. |
|
||
## Streaming clustering | ||
|
||
When data arrive in a stream, we may want to estimate clusters dynamically, updating them as new data arrive. MLlib provides support for streaming KMeans clustering, with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm uses a generalization of the mini-batch KMeans update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster using: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- line too wide
KMeans
->k-means
Test build #22426 has started for PR 2942 at commit
|
Test build #22426 has finished for PR 2942 at commit
|
Test FAILed. |
Test build #22428 has started for PR 2942 at commit
|
Test build #22428 has finished for PR 2942 at commit
|
Test PASSed. |
- Use a single halfLife parameter that now determines the decay factor directly - Allow specification of timeUnit for the halfLife as “batches” or “points” - Documentation adjusted accordingly
@mengxr I implemented the new parameterization (and tried to make the docs on it more intuitive), see what you think! |
Test build #22607 has started for PR 2942 at commit
|
Test build #22607 has finished for PR 2942 at commit
|
Test PASSed. |
@freeman-lab I made some changes: freeman-lab#1 , which includes the following:
If the update looks good to you, could you merge that PR? Thanks! |
Update Streaming K-Means
Test build #22673 has started for PR 2942 at commit
|
Test build #22673 has finished for PR 2942 at commit
|
Test PASSed. |
@mengxr great updates! LGMT. Just need to update the doc/examples in a couple places I think. |
Test build #22677 has started for PR 2942 at commit
|
Test build #22677 has finished for PR 2942 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks for adding streaming k-means! |
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.
The PR includes:
@tdas @mengxr @rezazadeh