Streaming KMeans [MLLIB][SPARK-3254] #2942

freeman-lab · 2014-10-25T08:12:46Z

This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:

StreamingKMeans algorithm with decay factor settings
Usage example
Additions to documentation clustering page
Unit tests of basic behavior and decay behaviors

@tdas @mengxr @rezazadeh

- Used trainOn and predictOn pattern, similar to StreamingLinearAlgorithm - Decay factor can be set explicitly, or via fractional decay parameters expressed in units of number of batches, or number of points - Unit tests for basic functionality and decay settings

SparkQA · 2014-10-25T08:17:24Z

Test build #22209 has started for PR 2942 at commit 2086bdc.

This patch merges cleanly.

SparkQA · 2014-10-25T09:29:43Z

Test build #22209 has finished for PR 2942 at commit 2086bdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingKMeansModel(
- class StreamingKMeans(

AmplabJenkins · 2014-10-25T09:29:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22209/
Test PASSed.

AtlasPilotPuppy · 2014-10-27T06:38:20Z

Should we create another PR for the python bindings/example?

coderxiang · 2014-10-28T00:10:56Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala

+
+@DeveloperApi
+class StreamingKMeans(
+     var k: Int,


indent of 4 spaces?

mengxr · 2014-10-28T17:19:30Z

@anantasty This PR is still in review. If you are interested in Python binding of streaming algorithms. Could you help add one for StreamingLinearRegression? Thanks!

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala

AtlasPilotPuppy · 2014-10-28T17:24:47Z

I would certainly be interested in doing that. I just wasn't sure if it was
better to do it as a separate PR/ task.
On Oct 28, 2014 11:19 AM, "Xiangrui Meng" [email protected] wrote:

@anantasty https://github.com/anantasty This PR is still in review. If
you are interested in Python binding of streaming algorithms. Could you
help add one for StreamingLinearRegression? Thanks!

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala

—
Reply to this email directly or view it on GitHub
#2942 (comment).

mengxr · 2014-10-28T17:27:04Z

It should be in a separate JIRA (and hence a separate PR). Please create a JIRA for StreamingLinearRegression and ping me there. Thanks!

freeman-lab · 2014-10-28T17:28:55Z

@anantasty Agreed, should be separate, but would be very cool to have! Ping me as well, happy to provide feedback.

mengxr · 2014-10-28T17:55:13Z

docs/mllib-clustering.md

+
+## Streaming clustering
+
+When data arrive in a stream, we may want to estimate clusters dynamically, updating them as new data arrive. MLlib provides support for streaming KMeans clustering, with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm uses a generalization of the mini-batch KMeans update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster using:


line too wide

KMeans -> k-means

SparkQA · 2014-10-29T05:17:32Z

Test build #22426 has started for PR 2942 at commit 374a706.

This patch merges cleanly.

SparkQA · 2014-10-29T05:18:28Z

Test build #22426 has finished for PR 2942 at commit 374a706.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingKMeansModel(
- class StreamingKMeans(

AmplabJenkins · 2014-10-29T05:18:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22426/
Test FAILed.

SparkQA · 2014-10-29T05:29:46Z

Test build #22428 has started for PR 2942 at commit 9f7aea9.

This patch merges cleanly.

SparkQA · 2014-10-29T06:44:50Z

Test build #22428 has finished for PR 2942 at commit 9f7aea9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingKMeansModel(
- class StreamingKMeans(

AmplabJenkins · 2014-10-29T06:44:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22428/
Test PASSed.

- Use a single halfLife parameter that now determines the decay factor directly - Allow specification of timeUnit for the halfLife as “batches” or “points” - Documentation adjusted accordingly

freeman-lab · 2014-10-31T09:05:23Z

@mengxr I implemented the new parameterization (and tried to make the docs on it more intuitive), see what you think!

SparkQA · 2014-10-31T09:10:12Z

Test build #22607 has started for PR 2942 at commit 0411bf5.

This patch merges cleanly.

SparkQA · 2014-10-31T10:23:13Z

Test build #22607 has finished for PR 2942 at commit 0411bf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingKMeansModel(
- class StreamingKMeans(

AmplabJenkins · 2014-10-31T10:23:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22607/
Test PASSed.

mengxr · 2014-10-31T19:47:36Z

@freeman-lab I made some changes: freeman-lab#1 , which includes the following:

discount on previous counts
detecting dying clusters
use BLAS if possible
use dense vectors in aggregation

If the update looks good to you, could you merge that PR? Thanks!

Update Streaming K-Means

SparkQA · 2014-11-01T02:02:31Z

Test build #22673 has started for PR 2942 at commit 078617c.

This patch merges cleanly.

SparkQA · 2014-11-01T03:18:18Z

Test build #22673 has finished for PR 2942 at commit 078617c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingKMeansModel(
- class StreamingKMeans(

AmplabJenkins · 2014-11-01T03:18:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22673/
Test PASSed.

freeman-lab · 2014-11-01T03:29:30Z

@mengxr great updates! LGMT. Just need to update the doc/examples in a couple places I think.

SparkQA · 2014-11-01T03:39:53Z

Test build #22677 has started for PR 2942 at commit b2e5b4a.

This patch merges cleanly.

SparkQA · 2014-11-01T04:58:45Z

Test build #22677 has finished for PR 2942 at commit b2e5b4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingKMeansModel(
- class StreamingKMeans(

AmplabJenkins · 2014-11-01T04:58:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22677/
Test PASSed.

mengxr · 2014-11-01T05:31:46Z

LGTM. Merged into master. Thanks for adding streaming k-means!

freeman-lab added 9 commits August 28, 2014 16:32

Merge remote-tracking branch 'upstream/master' into streaming-kmeans

9fd9c15

Merge remote-tracking branch 'upstream/master' into streaming-kmeans

a0fd790

Add better documentation

b5b5f8d

Add explanation and example to docs

f33684b

Example usage for StreamingKMeans

5db7074

Bug fix

9facbe3

More documentation

ea9877c

Log cluster center updates

2086bdc

coderxiang reviewed Oct 28, 2014
View reviewed changes

mengxr reviewed Oct 28, 2014
View reviewed changes

freeman-lab added 4 commits October 28, 2014 22:14

Make random seed an argument

9cfc301

Make initialization check an assertion

77dbd3f

Use labeled points and predictOnValues in examples

ad9bdc2

Formatting

374a706

Style fixes

9f7aea9

Change decay parameterization

0411bf5

- Use a single halfLife parameter that now determines the decay factor directly - Allow specification of timeUnit for the halfLife as “batches” or “points” - Documentation adjusted accordingly

take discount on previous weights; use BLAS; detect dying clusters

2e682c0

Merge pull request #1 from mengxr/SPARK-3254

078617c

Update Streaming K-Means

Fixes to docs / examples

b2e5b4a

asfgit closed this in 98c556e Nov 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming KMeans [MLLIB][SPARK-3254] #2942

Streaming KMeans [MLLIB][SPARK-3254] #2942

freeman-lab commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

AtlasPilotPuppy commented Oct 27, 2014

coderxiang Oct 28, 2014

mengxr commented Oct 28, 2014

AtlasPilotPuppy commented Oct 28, 2014

mengxr commented Oct 28, 2014

freeman-lab commented Oct 28, 2014

mengxr Oct 28, 2014

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

freeman-lab commented Oct 31, 2014

SparkQA commented Oct 31, 2014

SparkQA commented Oct 31, 2014

AmplabJenkins commented Oct 31, 2014

mengxr commented Oct 31, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

freeman-lab commented Nov 1, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

mengxr commented Nov 1, 2014


		## Streaming clustering

		When data arrive in a stream, we may want to estimate clusters dynamically, updating them as new data arrive. MLlib provides support for streaming KMeans clustering, with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm uses a generalization of the mini-batch KMeans update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster using:

Streaming KMeans [MLLIB][SPARK-3254] #2942

Streaming KMeans [MLLIB][SPARK-3254] #2942

Conversation

freeman-lab commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

AtlasPilotPuppy commented Oct 27, 2014

coderxiang Oct 28, 2014

Choose a reason for hiding this comment

mengxr commented Oct 28, 2014

AtlasPilotPuppy commented Oct 28, 2014

mengxr commented Oct 28, 2014

freeman-lab commented Oct 28, 2014

mengxr Oct 28, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

freeman-lab commented Oct 31, 2014

SparkQA commented Oct 31, 2014

SparkQA commented Oct 31, 2014

AmplabJenkins commented Oct 31, 2014

mengxr commented Oct 31, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

freeman-lab commented Nov 1, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

mengxr commented Nov 1, 2014