[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

viirya · 2018-02-13T05:08:06Z

What changes were proposed in this pull request?

Problem:

Since 2.3, Bucketizer supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if inputCols and outputCol are both set, an error will be thrown.

However, when we write Bucketizer, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default outputCol param in HasOutputCol trait will be set in paramMap and become an user-supplied param. That makes the check of exclusive params failed.

Fix:

This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later.

Please see the discussion in the JIRA.

Note: The multi-column QuantileDiscretizer also has the same issue.

How was this patch tested?

Modified tests.

viirya · 2018-02-13T05:10:05Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+      }
+      DefaultParamsWriter.saveMetadata(instance, path, sc)
+      // Add the default param back.
+      removedOutputCol.map(instance.setDefault(instance.outputCol, _))


Although the saving logic is the same as QuantileDiscretizerWriter, I leave them as duplicate for now since this is a quick fix. If there is strong preference, I can make a common class for it.

viirya · 2018-02-13T05:18:36Z

cc @jkbradley

SparkQA · 2018-02-13T08:05:01Z

Test build #87366 has finished for PR 20594 at commit 9cd7c86.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter

viirya · 2018-02-13T08:39:51Z

retest this please.

WeichenXu123

This quick fix works fine I think, but I leave a small question.

WeichenXu123 · 2018-02-13T09:21:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+      // value of `outputCol` if `inputCols` is set before saving.
+      // TODO: If we modify the persistence mechanism later to better handle default params,
+      // we can get rid of this.
+      var removedOutputCol: Option[String] = None


I doubt whether it need a "lock" here, because it is the way "clear default value first, then save model, then restore default value".
Maybe wrapping the code block here by synchronized is safer ?

I was thinking about this too. But looks like we don't add lock to the places we might change params in ML. I guess that we assume the usage of ML models is single-threaded. So I leave it as this. Will add it if others think this is required too.

yep. But I have some new thoughts, see my comments at bottom. -:)

SparkQA · 2018-02-13T12:01:50Z

Test build #87379 has finished for PR 20594 at commit 9cd7c86.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter

viirya · 2018-02-13T12:33:49Z

retest this please.

mgaido91 · 2018-02-13T16:09:57Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+      // TODO: If we modify the persistence mechanism later to better handle default params,
+      // we can get rid of this.
+      var removedOutputCol: Option[String] = None
+      if (instance.isSet(instance.inputCols)) {


this can create a lot of issues with the Python API. Please see #20410 for reference. Thus I am against this fix, unless we first fix the problem I linked

Why? I think they are orthogonal and this shouldn't cause the issue in Python side. Besides, as the PySpark multi-column support is not added yet (it's reverted), I think we don't hit the Python API issue. This is a quick fix to deal with the persistence bug. I'm not sure we should be blocked.

Yes I think #20410 is not related to this PR for now. But I am afraid in the future, when we add more functionality, potential bugs will possible to be triggered.
But I think we don't need to care the order of them to be merged. :)

SparkQA · 2018-02-13T16:24:55Z

Test build #87392 has finished for PR 20594 at commit 9cd7c86.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter

WeichenXu123 · 2018-02-14T03:09:04Z

I thought again, instead of "removing default value and restore it again later (which may cause some side effects)", maybe the better way is, add a parameter to DefaultParamsWriter.saveMetadata, specify which default param need to skip when saving.

@mgaido91 Yes I agree with you. Either #20410 or #18982 need to be merged to 2.3, the related issue is possible to cause some strange bugs.

cc @jkbradley

viirya · 2018-02-14T03:44:23Z

Because this is a quick fix, my idea is to have a surface patch that doesn't change existing API. The approach of adding parameter to DefaultParamsWriter.saveMetadata also sounds good to me, but the parameter seems useless if we get rid of this quick fix in the future.

Instead of adding parameter, how about we pass the paramMap parameter when calling saveMetadata?

For #20410 and #18982, I have a question, are they regression? Seems to me they are not new issues to 2.3.

SparkQA · 2018-02-14T05:40:34Z

Test build #87440 has finished for PR 20594 at commit 3a29039.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-02-14T08:08:53Z

@WeichenXu123 @viirya as I said in the other PR, I think no default value should be persisted. #20410 and #18982 are not regression: they are problem which have been present in all the release so far, but they are showing up more and more "thanks" to all the models having the dualism inputCol/inputCols.

Every usage of isSeton Scala side is a problem with the Python API until either one of them will be merged. And this issue is the same. After persisting, every usage of isSet is not working as intended. Therefore I'd be for either not to store any default value or store them writing explicitly that they are default values.

jkbradley · 2018-02-14T18:10:33Z

@mgaido91 Thanks for your thoughts. We do need to persist default values; please check out the JIRA. For fixing Python, I think the best fix will be to transfer the default & explicitly set Params to Java separately, rather than treating them all as explicitly set.

jkbradley

Thanks for the updated fix @viirya ! This approach looks great to me. I just had 2 tiny comments for Bucketizer, and they also apply to QuantileDiscretizer.

jkbradley · 2018-02-14T17:41:25Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

@@ -213,6 +217,9 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
  override def copy(extra: ParamMap): Bucketizer = {
    defaultCopy[Bucketizer](extra).setParent(parent)
  }
+
+  @Since("2.3.0")


No need for this since annotation; the signature isn't changed in 2.3.0

jkbradley · 2018-02-14T18:10:09Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+      // we can get rid of this.
+      var paramWithoutOutputCol: Option[JValue] = None
+      if (instance.isSet(instance.inputCols)) {
+        val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]


I don't think this asInstanceOf cast is necessary.

mgaido91 · 2018-02-14T19:49:15Z

@jkbradley thanks for your answer. I think that the 3rd approach you suggested on the JIRA is the right way to go on a long term plan. Personally, I disagree with you when you say that we should keep the default values. I think that changing a default value doesn't happen often and if it happens it is not a problem: if the user cares about the value of a parameter, he sets it. But this is just my opinion.

viirya · 2018-02-15T00:15:16Z

Thanks @jkbradley ! I've updated this based on your comments.

jkbradley · 2018-02-15T01:19:30Z

Thanks! LGTM pending tests

SparkQA · 2018-02-15T01:20:32Z

Test build #87458 has finished for PR 20594 at commit 174c114.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-02-15T17:42:18Z

Merging with master and branch-2.3

jkbradley · 2018-02-15T17:59:51Z

Well I succeeded in merging this with master, but the merge script isn't working for branch-2.3. I wait to see if the read-only repo syncs and fixes the issue.

jkbradley · 2018-02-15T19:23:42Z

Success! Merged to branch-2.3 too.

## What changes were proposed in this pull request? #### Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if `inputCols` and `outputCol` are both set, an error will be thrown. However, when we write `Bucketizer`, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default `outputCol` param in `HasOutputCol` trait will be set in `paramMap` and become an user-supplied param. That makes the check of exclusive params failed. #### Fix: This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later. Please see the discussion in the JIRA. Note: The multi-column `QuantileDiscretizer` also has the same issue. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <[email protected]> Closes #20594 from viirya/SPARK-23377-2. (cherry picked from commit db45daa) Signed-off-by: Joseph K. Bradley <[email protected]>

Remove outputCol default value if inputCols is set.

9cd7c86

viirya commented Feb 13, 2018

View reviewed changes

viirya mentioned this pull request Feb 13, 2018

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20566

Closed

WeichenXu123 reviewed Feb 13, 2018

View reviewed changes

mgaido91 reviewed Feb 13, 2018

View reviewed changes

Choose to skip default of outputCol when saving metadata.

3a29039

jkbradley reviewed Feb 14, 2018

View reviewed changes

Address comments.

174c114

asfgit closed this in db45daa Feb 15, 2018

viirya deleted the SPARK-23377-2 branch December 27, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

viirya commented Feb 13, 2018

viirya Feb 13, 2018

viirya commented Feb 13, 2018

SparkQA commented Feb 13, 2018

viirya commented Feb 13, 2018

WeichenXu123 left a comment

WeichenXu123 Feb 13, 2018

viirya Feb 13, 2018 •

edited

Loading

WeichenXu123 Feb 14, 2018

SparkQA commented Feb 13, 2018

viirya commented Feb 13, 2018

mgaido91 Feb 13, 2018 •

edited

Loading

viirya Feb 14, 2018

WeichenXu123 Feb 14, 2018

SparkQA commented Feb 13, 2018

WeichenXu123 commented Feb 14, 2018

viirya commented Feb 14, 2018 •

edited

Loading

SparkQA commented Feb 14, 2018

mgaido91 commented Feb 14, 2018

jkbradley commented Feb 14, 2018

jkbradley left a comment

jkbradley Feb 14, 2018

jkbradley Feb 14, 2018

mgaido91 commented Feb 14, 2018

viirya commented Feb 15, 2018

jkbradley commented Feb 15, 2018

SparkQA commented Feb 15, 2018

jkbradley commented Feb 15, 2018

jkbradley commented Feb 15, 2018

jkbradley commented Feb 15, 2018

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

Conversation

viirya commented Feb 13, 2018

What changes were proposed in this pull request?

Problem:

Fix:

How was this patch tested?

viirya Feb 13, 2018

Choose a reason for hiding this comment

viirya commented Feb 13, 2018

SparkQA commented Feb 13, 2018

viirya commented Feb 13, 2018

WeichenXu123 left a comment

Choose a reason for hiding this comment

WeichenXu123 Feb 13, 2018

Choose a reason for hiding this comment

viirya Feb 13, 2018 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 Feb 14, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 13, 2018

viirya commented Feb 13, 2018

mgaido91 Feb 13, 2018 • edited Loading

Choose a reason for hiding this comment

viirya Feb 14, 2018

Choose a reason for hiding this comment

WeichenXu123 Feb 14, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 13, 2018

WeichenXu123 commented Feb 14, 2018

viirya commented Feb 14, 2018 • edited Loading

SparkQA commented Feb 14, 2018

mgaido91 commented Feb 14, 2018

jkbradley commented Feb 14, 2018

jkbradley left a comment

Choose a reason for hiding this comment

jkbradley Feb 14, 2018

Choose a reason for hiding this comment

jkbradley Feb 14, 2018

Choose a reason for hiding this comment

mgaido91 commented Feb 14, 2018

viirya commented Feb 15, 2018

jkbradley commented Feb 15, 2018

SparkQA commented Feb 15, 2018

jkbradley commented Feb 15, 2018

jkbradley commented Feb 15, 2018

jkbradley commented Feb 15, 2018

viirya Feb 13, 2018 •

edited

Loading

mgaido91 Feb 13, 2018 •

edited

Loading

viirya commented Feb 14, 2018 •

edited

Loading