In the following example after loading and parsing data, we use a
-[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
-object to cluster the data into two clusters. The number of desired clusters is passed
+[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
+object to cluster the data into two clusters. The number of desired clusters is passed
to the algorithm. We then output the parameters of the mixture model.
{% highlight scala %}
@@ -238,7 +238,7 @@ val gmm = new GaussianMixture().setK(2).run(parsedData)
// output parameters of max-likelihood model
for (i <- 0 until gmm.k) {
- println("weight=%f\nmu=%s\nsigma=\n%s\n" format
+ println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}
@@ -298,7 +298,7 @@ public class GaussianMixtureExample {
In the following example after loading and parsing data, we use a
[GaussianMixture](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture)
-object to cluster the data into two clusters. The number of desired clusters is passed
+object to cluster the data into two clusters. The number of desired clusters is passed
to the algorithm. We then output the parameters of the mixture model.
{% highlight python %}
@@ -326,7 +326,7 @@ for i in range(2):
In the following example, we load word count vectors representing a corpus of documents.
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
-to infer three topics from the documents. The number of desired clusters is passed
+to infer three topics from the documents. The number of desired clusters is passed
to the algorithm. We then output the topics, represented as probability distributions over words.
@@ -428,27 +428,27 @@ a dependency.
## Streaming clustering
-When data arrive in a stream, we may want to estimate clusters dynamically,
-updating them as new data arrive. MLlib provides support for streaming k-means clustering,
-with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
-uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
+When data arrive in a stream, we may want to estimate clusters dynamically,
+updating them as new data arrive. MLlib provides support for streaming k-means clustering,
+with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
+uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
all points to their nearest cluster, compute new cluster centers, then update each cluster using:
`\begin{equation}
c_{t+1} = \frac{c_tn_t\alpha + x_tm_t}{n_t\alpha+m_t}
\end{equation}`
`\begin{equation}
- n_{t+1} = n_t + m_t
+ n_{t+1} = n_t + m_t
\end{equation}`
-Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
-to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
-is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
-can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
-with `$\alpha$=0` only the most recent data will be used. This is analogous to an
-exponentially-weighted moving average.
+Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
+to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
+is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
+can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
+with `$\alpha$=0` only the most recent data will be used. This is analogous to an
+exponentially-weighted moving average.
-The decay can be specified using a `halfLife` parameter, which determines the
+The decay can be specified using a `halfLife` parameter, which determines the
correct decay factor `a` such that, for data acquired
at time `t`, its contribution by time `t + halfLife` will have dropped to 0.5.
The unit of time can be specified either as `batches` or `points` and the update rule
@@ -472,9 +472,9 @@ import org.apache.spark.mllib.clustering.StreamingKMeans
{% endhighlight %}
-Then we make an input stream of vectors for training, as well as a stream of labeled data
-points for testing. We assume a StreamingContext `ssc` has been created, see
-[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
+Then we make an input stream of vectors for training, as well as a stream of labeled data
+points for testing. We assume a StreamingContext `ssc` has been created, see
+[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
{% highlight scala %}
@@ -496,24 +496,24 @@ val model = new StreamingKMeans()
{% endhighlight %}
-Now register the streams for training and testing and start the job, printing
+Now register the streams for training and testing and start the job, printing
the predicted cluster assignments on new data points as they arrive.
{% highlight scala %}
model.trainOn(trainingData)
-model.predictOnValues(testData).print()
+model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
-
+
{% endhighlight %}
-As you add new text files with data the cluster centers will update. Each training
+As you add new text files with data the cluster centers will update. Each training
point should be formatted as `[x1, x2, x3]`, and each test data point
-should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
-(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
-the model will update. Anytime a text file is placed in `/testing/data/dir`
+should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
+(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
+the model will update. Anytime a text file is placed in `/testing/data/dir`
you will see predictions. With new data, the cluster centers will change!