-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30102][ML][PYSPARK] GMM supports instance weighting #26735
Conversation
@@ -267,6 +268,18 @@ class GaussianMixtureSuite extends MLTest with DefaultReadWriteTest { | |||
assert(trueLikelihood ~== floatLikelihood absTol 1e-6) | |||
} | |||
|
|||
test("GMM support instance weighting") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final model is highly sensitive to the initlization, which is computed in initRandom
and highly related to the data distribution & partition.
Test build #114725 has finished for PR 26735 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't suppose there's any clean way to avoid the added computation for weights when the weights are all 1? looks hard. I'm OK with it
It really took me some days to look into the test failure. 1, In 2.4.4, I can not reproduce the doctests locally: >>> summary.clusterSizes
[2, 2, 2]
>>> summary.logLikelihood
8.14636... until I explictly set numPartition=2, like this 2, After using
The metrics are near before iter-7, but some sudden numeric change happened in iter-7. But I think it is acceptable since the internal computation is complex. So I think I need to: @srowen @huaxingao How do you think about it? |
An approach is to have two method: the second method do not include any extra multiplication. |
5e898e2
to
a068214
Compare
Test build #115008 has finished for PR 26735 at commit
|
I guess instead of changing maxIter=5 and compare the logLikelihood at iteration 5, maybe use a much bigger maxIter so it will converge. Compare the logLikelihood at convergence. It puzzled me why the logLikelihoods from iteration 7 are so different from the logLikelihoods computed using the original code. Weight is not set in the python doctest and it uses default 1.0. So in theory, this should behave exact the same as the original code, the logLikelihood at each iteration should be very similar as the logLikelihood computed using the original code, right? I tried both the original code and the code with changes, they start to have different logLikelihood at iteration 7, but both of them converge at iteration 25, with the same logLikelihood 65.02945125241477. I agree that we probably need to change the current convergence check. Seems to me that we also need to compare the logLikelihood difference to the previous difference. The difference should be smaller and smaller and eventually converge. However, I tested with the current code, the logLikelihood differences are not getting smaller consistently.
|
@huaxingao Thanks a lot for looking into this! |
@zhengruifeng does this need more investigation? |
a068214
to
fbea774
Compare
retest this please |
fbea774
to
50e15fe
Compare
@huaxingao I found that the difference comes from the method to compute var @srowen I think this PR should be OK to merge. However, as disscussed above, we can see that the convergence of GMM is not stable, the loglikelihood may even drop sharply during the training procedure (which should not happen in theory). I think we need to make GMM more numerical stable in the future, but as to supporting instance weighting, I think current PR is OK. |
@@ -41,6 +41,24 @@ private[spark] object MetadataUtils { | |||
} | |||
} | |||
|
|||
/** | |||
* Examine a schema to identify the number of features in a vector column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same method is also added in PR for RobustScaler, I tend to merge that first, and rebase this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking OK to me
slice.foreach(xi => ss += (xi.asBreeze - mean.asBreeze) ^:^ 2.0) | ||
var j = 0 | ||
while (j < numSamples) { | ||
ss += ((sampleSlice(j).asBreeze - mean.asBreeze) ^:^ 2.0) * weightSlice(j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a similar bit of code I had to change recently to compile in Scala 2.13. You don't have to address it here, but, I think that using val v: BV[Double] = sampleSlice(j).asBreeze - mean.asBreeze
first fixes it
val weightSum = sampleWeights.sum | ||
|
||
val gaussians = Array.tabulate(numClusters) { i => | ||
val sampleSlice = samples.view(i * numSamples, (i + 1) * numSamples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you could factor out i * numSamples
here and use it 4 times. Might be very slightly tidier
Test build #115659 has finished for PR 26735 at commit
|
Test build #115668 has finished for PR 26735 at commit
|
retest this please |
Test build #115725 has finished for PR 26735 at commit
|
26f6564
to
53efa70
Compare
Test build #115754 has finished for PR 26735 at commit
|
Merged to master, thanks @srowen @huaxingao for reviewing! |
### What changes were proposed in this pull request? supports instance weighting in GMM ### Why are the changes needed? ML should support instance weighting ### Does this PR introduce any user-facing change? yes, a new param `weightCol` is exposed ### How was this patch tested? added testsuits Closes apache#26735 from zhengruifeng/gmm_support_weight. Authored-by: zhengruifeng <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
### What changes were proposed in this pull request? This PR aims to reduce the test weight `100` to `90` to improve the robustness of test case `GMM support instance weighting`. ```scala test("GMM support instance weighting") { val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed) val gm2 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight") - Seq(1.0, 10.0, 100.0).foreach { w => + Seq(1.0, 10.0, 90.0).foreach { w => ``` ### Why are the changes needed? As mentioned in #26735 (comment), the weights of model changes when the weights grow. And, the final weight `100` seems to be high enough to cause failures on some JVMs. This is observed in Java 17 M1 native mode. ``` $ java -version openjdk version "17" 2021-09-14 LTS OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "mllib/test" ... [info] - GMM support instance weighting *** FAILED *** (1 second, 722 milliseconds) [info] Expected 0.10476714410584752 and 1.209081654091291E-14 to be within 0.001 using absolute tolerance. (TestingUtils.scala:88) ... [info] *** 1 TEST FAILED *** [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7 [error] Failed tests: [error] org.apache.spark.ml.clustering.GaussianMixtureSuite [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM ``` **AFTER** ``` [info] Total number of tests run: 1638 [info] Suites: completed 205, aborted 0 [info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7 [success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #34584 from dongjoon-hyun/SPARK-37317. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to reduce the test weight `100` to `90` to improve the robustness of test case `GMM support instance weighting`. ```scala test("GMM support instance weighting") { val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed) val gm2 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight") - Seq(1.0, 10.0, 100.0).foreach { w => + Seq(1.0, 10.0, 90.0).foreach { w => ``` ### Why are the changes needed? As mentioned in #26735 (comment), the weights of model changes when the weights grow. And, the final weight `100` seems to be high enough to cause failures on some JVMs. This is observed in Java 17 M1 native mode. ``` $ java -version openjdk version "17" 2021-09-14 LTS OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "mllib/test" ... [info] - GMM support instance weighting *** FAILED *** (1 second, 722 milliseconds) [info] Expected 0.10476714410584752 and 1.209081654091291E-14 to be within 0.001 using absolute tolerance. (TestingUtils.scala:88) ... [info] *** 1 TEST FAILED *** [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7 [error] Failed tests: [error] org.apache.spark.ml.clustering.GaussianMixtureSuite [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM ``` **AFTER** ``` [info] Total number of tests run: 1638 [info] Suites: completed 205, aborted 0 [info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7 [success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #34584 from dongjoon-hyun/SPARK-37317. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a1f2ae0) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to reduce the test weight `100` to `90` to improve the robustness of test case `GMM support instance weighting`. ```scala test("GMM support instance weighting") { val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed) val gm2 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight") - Seq(1.0, 10.0, 100.0).foreach { w => + Seq(1.0, 10.0, 90.0).foreach { w => ``` ### Why are the changes needed? As mentioned in apache#26735 (comment), the weights of model changes when the weights grow. And, the final weight `100` seems to be high enough to cause failures on some JVMs. This is observed in Java 17 M1 native mode. ``` $ java -version openjdk version "17" 2021-09-14 LTS OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "mllib/test" ... [info] - GMM support instance weighting *** FAILED *** (1 second, 722 milliseconds) [info] Expected 0.10476714410584752 and 1.209081654091291E-14 to be within 0.001 using absolute tolerance. (TestingUtils.scala:88) ... [info] *** 1 TEST FAILED *** [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7 [error] Failed tests: [error] org.apache.spark.ml.clustering.GaussianMixtureSuite [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM ``` **AFTER** ``` [info] Total number of tests run: 1638 [info] Suites: completed 205, aborted 0 [info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7 [success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#34584 from dongjoon-hyun/SPARK-37317. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a1f2ae0) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to reduce the test weight `100` to `90` to improve the robustness of test case `GMM support instance weighting`. ```scala test("GMM support instance weighting") { val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed) val gm2 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight") - Seq(1.0, 10.0, 100.0).foreach { w => + Seq(1.0, 10.0, 90.0).foreach { w => ``` ### Why are the changes needed? As mentioned in apache#26735 (comment), the weights of model changes when the weights grow. And, the final weight `100` seems to be high enough to cause failures on some JVMs. This is observed in Java 17 M1 native mode. ``` $ java -version openjdk version "17" 2021-09-14 LTS OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "mllib/test" ... [info] - GMM support instance weighting *** FAILED *** (1 second, 722 milliseconds) [info] Expected 0.10476714410584752 and 1.209081654091291E-14 to be within 0.001 using absolute tolerance. (TestingUtils.scala:88) ... [info] *** 1 TEST FAILED *** [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7 [error] Failed tests: [error] org.apache.spark.ml.clustering.GaussianMixtureSuite [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM ``` **AFTER** ``` [info] Total number of tests run: 1638 [info] Suites: completed 205, aborted 0 [info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7 [success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#34584 from dongjoon-hyun/SPARK-37317. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a1f2ae0) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to reduce the test weight `100` to `90` to improve the robustness of test case `GMM support instance weighting`. ```scala test("GMM support instance weighting") { val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed) val gm2 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight") - Seq(1.0, 10.0, 100.0).foreach { w => + Seq(1.0, 10.0, 90.0).foreach { w => ``` ### Why are the changes needed? As mentioned in apache#26735 (comment), the weights of model changes when the weights grow. And, the final weight `100` seems to be high enough to cause failures on some JVMs. This is observed in Java 17 M1 native mode. ``` $ java -version openjdk version "17" 2021-09-14 LTS OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "mllib/test" ... [info] - GMM support instance weighting *** FAILED *** (1 second, 722 milliseconds) [info] Expected 0.10476714410584752 and 1.209081654091291E-14 to be within 0.001 using absolute tolerance. (TestingUtils.scala:88) ... [info] *** 1 TEST FAILED *** [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7 [error] Failed tests: [error] org.apache.spark.ml.clustering.GaussianMixtureSuite [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM ``` **AFTER** ``` [info] Total number of tests run: 1638 [info] Suites: completed 205, aborted 0 [info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7 [success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#34584 from dongjoon-hyun/SPARK-37317. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a1f2ae0) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
supports instance weighting in GMM
Why are the changes needed?
ML should support instance weighting
Does this PR introduce any user-facing change?
yes, a new param
weightCol
is exposedHow was this patch tested?
added testsuits