Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

Closed
wants to merge 3 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 13, 2018

What changes were proposed in this pull request?

Problem:

Since 2.3, Bucketizer supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if inputCols and outputCol are both set, an error will be thrown.

However, when we write Bucketizer, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default outputCol param in HasOutputCol trait will be set in paramMap and become an user-supplied param. That makes the check of exclusive params failed.

Fix:

This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later.

Please see the discussion in the JIRA.

Note: The multi-column QuantileDiscretizer also has the same issue.

How was this patch tested?

Modified tests.

}
DefaultParamsWriter.saveMetadata(instance, path, sc)
// Add the default param back.
removedOutputCol.map(instance.setDefault(instance.outputCol, _))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the saving logic is the same as QuantileDiscretizerWriter, I leave them as duplicate for now since this is a quick fix. If there is strong preference, I can make a common class for it.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2018

cc @jkbradley

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87366 has finished for PR 20594 at commit 9cd7c86.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter

@viirya
Copy link
Member Author

viirya commented Feb 13, 2018

retest this please.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This quick fix works fine I think, but I leave a small question.

// value of `outputCol` if `inputCols` is set before saving.
// TODO: If we modify the persistence mechanism later to better handle default params,
// we can get rid of this.
var removedOutputCol: Option[String] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt whether it need a "lock" here, because it is the way "clear default value first, then save model, then restore default value".
Maybe wrapping the code block here by synchronized is safer ?

Copy link
Member Author

@viirya viirya Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this too. But looks like we don't add lock to the places we might change params in ML. I guess that we assume the usage of ML models is single-threaded. So I leave it as this. Will add it if others think this is required too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. But I have some new thoughts, see my comments at bottom. -:)

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87379 has finished for PR 20594 at commit 9cd7c86.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter

@viirya
Copy link
Member Author

viirya commented Feb 13, 2018

retest this please.

// TODO: If we modify the persistence mechanism later to better handle default params,
// we can get rid of this.
var removedOutputCol: Option[String] = None
if (instance.isSet(instance.inputCols)) {
Copy link
Contributor

@mgaido91 mgaido91 Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can create a lot of issues with the Python API. Please see #20410 for reference. Thus I am against this fix, unless we first fix the problem I linked

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think they are orthogonal and this shouldn't cause the issue in Python side. Besides, as the PySpark multi-column support is not added yet (it's reverted), I think we don't hit the Python API issue. This is a quick fix to deal with the persistence bug. I'm not sure we should be blocked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think #20410 is not related to this PR for now. But I am afraid in the future, when we add more functionality, potential bugs will possible to be triggered.
But I think we don't need to care the order of them to be merged. :)

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87392 has finished for PR 20594 at commit 9cd7c86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter

@WeichenXu123
Copy link
Contributor

I thought again, instead of "removing default value and restore it again later (which may cause some side effects)", maybe the better way is, add a parameter to DefaultParamsWriter.saveMetadata, specify which default param need to skip when saving.

@mgaido91 Yes I agree with you. Either #20410 or #18982 need to be merged to 2.3, the related issue is possible to cause some strange bugs.

cc @jkbradley

@viirya
Copy link
Member Author

viirya commented Feb 14, 2018

Because this is a quick fix, my idea is to have a surface patch that doesn't change existing API. The approach of adding parameter to DefaultParamsWriter.saveMetadata also sounds good to me, but the parameter seems useless if we get rid of this quick fix in the future.

Instead of adding parameter, how about we pass the paramMap parameter when calling saveMetadata?

For #20410 and #18982, I have a question, are they regression? Seems to me they are not new issues to 2.3.

@SparkQA
Copy link

SparkQA commented Feb 14, 2018

Test build #87440 has finished for PR 20594 at commit 3a29039.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

@WeichenXu123 @viirya as I said in the other PR, I think no default value should be persisted. #20410 and #18982 are not regression: they are problem which have been present in all the release so far, but they are showing up more and more "thanks" to all the models having the dualism inputCol/inputCols.

Every usage of isSeton Scala side is a problem with the Python API until either one of them will be merged. And this issue is the same. After persisting, every usage of isSet is not working as intended. Therefore I'd be for either not to store any default value or store them writing explicitly that they are default values.

@jkbradley
Copy link
Member

@mgaido91 Thanks for your thoughts. We do need to persist default values; please check out the JIRA. For fixing Python, I think the best fix will be to transfer the default & explicitly set Params to Java separately, rather than treating them all as explicitly set.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated fix @viirya ! This approach looks great to me. I just had 2 tiny comments for Bucketizer, and they also apply to QuantileDiscretizer.

@@ -213,6 +217,9 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
override def copy(extra: ParamMap): Bucketizer = {
defaultCopy[Bucketizer](extra).setParent(parent)
}

@Since("2.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this since annotation; the signature isn't changed in 2.3.0

// we can get rid of this.
var paramWithoutOutputCol: Option[JValue] = None
if (instance.isSet(instance.inputCols)) {
val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this asInstanceOf cast is necessary.

@mgaido91
Copy link
Contributor

@jkbradley thanks for your answer. I think that the 3rd approach you suggested on the JIRA is the right way to go on a long term plan. Personally, I disagree with you when you say that we should keep the default values. I think that changing a default value doesn't happen often and if it happens it is not a problem: if the user cares about the value of a parameter, he sets it. But this is just my opinion.

@viirya
Copy link
Member Author

viirya commented Feb 15, 2018

Thanks @jkbradley ! I've updated this based on your comments.

@jkbradley
Copy link
Member

Thanks! LGTM pending tests

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87458 has finished for PR 20594 at commit 174c114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master and branch-2.3

@asfgit asfgit closed this in db45daa Feb 15, 2018
@jkbradley
Copy link
Member

Well I succeeded in merging this with master, but the merge script isn't working for branch-2.3. I wait to see if the read-only repo syncs and fixes the issue.

@jkbradley
Copy link
Member

Success! Merged to branch-2.3 too.

asfgit pushed a commit that referenced this pull request Feb 15, 2018
## What changes were proposed in this pull request?

#### Problem:

Since 2.3, `Bucketizer` supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if `inputCols` and `outputCol` are both set, an error will be thrown.

However, when we write `Bucketizer`, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default `outputCol` param in `HasOutputCol` trait will be set in `paramMap` and become an user-supplied param. That makes the check of exclusive params failed.

#### Fix:

This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later.

Please see the discussion in the JIRA.

Note: The multi-column `QuantileDiscretizer` also has the same issue.

## How was this patch tested?

Modified tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes #20594 from viirya/SPARK-23377-2.

(cherry picked from commit db45daa)
Signed-off-by: Joseph K. Bradley <[email protected]>
@viirya viirya deleted the SPARK-23377-2 branch December 27, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants