[SPARK-29967][ML][PYTHON] KMeans support instance weighting #26739

huaxingao · 2019-12-02T18:42:21Z

What changes were proposed in this pull request?

add weight support in KMeans

Why are the changes needed?

KMeans should support weighting

Does this PR introduce any user-facing change?

Yes. KMeans.setWeightCol

How was this patch tested?

Unit Tests

huaxingao · 2019-12-02T18:45:18Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-      if (iteration == 0) {
-        instr.foreach(_.logNumExamples(collected.values.map(_._2).sum))
-      }
-


Don't have counts any more. Is it OK to remove this?

Can you just log the sum of weights? it keeps the same info in the unweighted case and it's still sort of meaningful as 'number of examples' in the weighted case

1, I guess we need to add a new var count: Long to get the total count of dataset, since in other algs like LinearSVC,LogisticRegression, instr.logNumExamples logs the unweighted count;
2, Since more and more algs support weightCol, I think we may added a new method like instr.logSumOfWeights

I guess maybe leave the code this way for now and open a separate PR later on to add method instr.logSumOfWeights and use it in all the algs that support weight?

I am OK to add new instr.log in other PR.
Here I prefer to keep instr.logNumExamples log the unweighted count, in order to keep it in sync with other algs.

Updated. Thanks!
I also added logSumOfWeights. I will update other algs that has weightCol once this PR is merged.

SparkQA · 2019-12-02T19:56:29Z

Test build #114735 has finished for PR 26739 at commit f6b44d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks resaonable.

srowen · 2019-12-03T14:52:28Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

  }

-  private[spark] def run(
-      data: RDD[Vector],
+  private[spark] def runWithweight(


Nit: runWithWeight

srowen · 2019-12-03T14:54:28Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-      if (iteration == 0) {
-        instr.foreach(_.logNumExamples(collected.values.map(_._2).sum))
-      }
-


Can you just log the sum of weights? it keeps the same info in the unweighted case and it's still sort of meaningful as 'number of examples' in the weighted case

srowen · 2019-12-03T14:55:16Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+        val clusterWeightSum = Array.fill(thisCenters.length)(0.0)
+
+        pointsAndWeights.foreach { case (point, weight) =>
+          var (bestCenter, cost) = distanceMeasureInstance.findClosest(thisCenters, point)


Total nit, but you can use val and then pass cost * weight to costAccum.add

SparkQA · 2019-12-03T17:45:49Z

Test build #114794 has finished for PR 26739 at commit d26d83d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2019-12-04T02:05:51Z

@zhengruifeng

zhengruifeng · 2019-12-04T02:03:31Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+    }
+
+    val instances: RDD[(OldVector, Double)] = dataset.select(
+      DatasetUtils.columnToVector(dataset, getFeaturesCol),


nit, why breaking this line?

dataset .select(DatasetUtils.columnToVector(dataset, getFeaturesCol), w) .rdd.map { ...

I will update the format.

zhengruifeng · 2019-12-04T02:15:40Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-      if (iteration == 0) {
-        instr.foreach(_.logNumExamples(collected.values.map(_._2).sum))
-      }
-


1, I guess we need to add a new var count: Long to get the total count of dataset, since in other algs like LinearSVC,LogisticRegression, instr.logNumExamples logs the unweighted count;
2, Since more and more algs support weightCol, I think we may added a new method like instr.logSumOfWeights

zhengruifeng · 2019-12-04T02:17:36Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

@@ -100,6 +100,18 @@ private[spark] abstract class DistanceMeasure extends Serializable {
    new VectorWithNorm(sum)
  }

+  /**


Is the above def centroid(sum: Vector, count: Long): VectorWithNorm still needed?

Yes. It is still used by BisecttingKMeans

zhengruifeng · 2019-12-04T02:18:45Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+        // clusterWeightSum is needed to calculate cluster center
+        // cluster center =
+        //     sample1 * weight1/clusterWeightSum + sample2 * weight2/clusterWeightSum + ...
+        val clusterWeightSum = Array.fill(thisCenters.length)(0.0)


nit, Array.ofDim[Double](thisCenters.length) or new Array[Double](thisCenters.length)

SparkQA · 2019-12-04T07:47:59Z

Test build #114831 has finished for PR 26739 at commit f55917d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-04T17:03:21Z

Test build #114869 has finished for PR 26739 at commit c664833.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-12-09T01:36:57Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

      }.collectAsMap()

      if (iteration == 0) {
-        instr.foreach(_.logNumExamples(collected.values.map(_._2).sum))
+        instr.foreach(_.logNumExamples(data.count()))


Nit, what about using a sc.longAccumulator to accumulate the count? like costAccum

Updated. Thanks!

SparkQA · 2019-12-09T18:09:08Z

Test build #115046 has finished for PR 26739 at commit 2e9f683.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-12-10T15:33:20Z

Merged to master

huaxingao · 2019-12-10T16:16:43Z

Thanks! @srowen @zhengruifeng

[SPARK-29967][ML][PYTHON] KMeans support instance weighting

f6b44d0

huaxingao commented Dec 2, 2019

View reviewed changes

srowen reviewed Dec 3, 2019

View reviewed changes

address comments

d26d83d

zhengruifeng reviewed Dec 4, 2019

View reviewed changes

address comments

f55917d

address comments

c664833

dongjoon-hyun added ML PYSPARK labels Dec 5, 2019

srowen approved these changes Dec 8, 2019

View reviewed changes

zhengruifeng reviewed Dec 9, 2019

View reviewed changes

address comments

2e9f683

zhengruifeng approved these changes Dec 10, 2019

View reviewed changes

srowen closed this in 1cac9b2 Dec 10, 2019

huaxingao deleted the spark-29967 branch December 10, 2019 16:16

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29967][ML][PYTHON] KMeans support instance weighting #26739

[SPARK-29967][ML][PYTHON] KMeans support instance weighting #26739

huaxingao commented Dec 2, 2019

huaxingao Dec 2, 2019

srowen Dec 3, 2019

zhengruifeng Dec 4, 2019

huaxingao Dec 4, 2019

zhengruifeng Dec 4, 2019

zhengruifeng Dec 4, 2019

huaxingao Dec 4, 2019

SparkQA commented Dec 2, 2019

srowen left a comment

srowen Dec 3, 2019

srowen Dec 3, 2019

srowen Dec 3, 2019

SparkQA commented Dec 3, 2019

huaxingao commented Dec 4, 2019

zhengruifeng Dec 4, 2019

huaxingao Dec 4, 2019

zhengruifeng Dec 4, 2019

zhengruifeng Dec 4, 2019

huaxingao Dec 4, 2019

zhengruifeng Dec 4, 2019

SparkQA commented Dec 4, 2019

SparkQA commented Dec 4, 2019

zhengruifeng Dec 9, 2019

huaxingao Dec 9, 2019

SparkQA commented Dec 9, 2019

srowen commented Dec 10, 2019

huaxingao commented Dec 10, 2019

[SPARK-29967][ML][PYTHON] KMeans support instance weighting #26739

[SPARK-29967][ML][PYTHON] KMeans support instance weighting #26739

Conversation

huaxingao commented Dec 2, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 2, 2019

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 3, 2019

huaxingao commented Dec 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2019

SparkQA commented Dec 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 9, 2019

srowen commented Dec 10, 2019

huaxingao commented Dec 10, 2019