-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594
Conversation
} | ||
DefaultParamsWriter.saveMetadata(instance, path, sc) | ||
// Add the default param back. | ||
removedOutputCol.map(instance.setDefault(instance.outputCol, _)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the saving logic is the same as QuantileDiscretizerWriter
, I leave them as duplicate for now since this is a quick fix. If there is strong preference, I can make a common class for it.
cc @jkbradley |
Test build #87366 has finished for PR 20594 at commit
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This quick fix works fine I think, but I leave a small question.
// value of `outputCol` if `inputCols` is set before saving. | ||
// TODO: If we modify the persistence mechanism later to better handle default params, | ||
// we can get rid of this. | ||
var removedOutputCol: Option[String] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt whether it need a "lock" here, because it is the way "clear default value first, then save model, then restore default value".
Maybe wrapping the code block here by synchronized
is safer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this too. But looks like we don't add lock to the places we might change params in ML. I guess that we assume the usage of ML models is single-threaded. So I leave it as this. Will add it if others think this is required too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. But I have some new thoughts, see my comments at bottom. -:)
Test build #87379 has finished for PR 20594 at commit
|
retest this please. |
// TODO: If we modify the persistence mechanism later to better handle default params, | ||
// we can get rid of this. | ||
var removedOutputCol: Option[String] = None | ||
if (instance.isSet(instance.inputCols)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can create a lot of issues with the Python API. Please see #20410 for reference. Thus I am against this fix, unless we first fix the problem I linked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I think they are orthogonal and this shouldn't cause the issue in Python side. Besides, as the PySpark multi-column support is not added yet (it's reverted), I think we don't hit the Python API issue. This is a quick fix to deal with the persistence bug. I'm not sure we should be blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think #20410 is not related to this PR for now. But I am afraid in the future, when we add more functionality, potential bugs will possible to be triggered.
But I think we don't need to care the order of them to be merged. :)
Test build #87392 has finished for PR 20594 at commit
|
I thought again, instead of "removing default value and restore it again later (which may cause some side effects)", maybe the better way is, add a parameter to @mgaido91 Yes I agree with you. Either #20410 or #18982 need to be merged to 2.3, the related issue is possible to cause some strange bugs. cc @jkbradley |
Because this is a quick fix, my idea is to have a surface patch that doesn't change existing API. The approach of adding parameter to Instead of adding parameter, how about we pass the For #20410 and #18982, I have a question, are they regression? Seems to me they are not new issues to 2.3. |
Test build #87440 has finished for PR 20594 at commit
|
@WeichenXu123 @viirya as I said in the other PR, I think no default value should be persisted. #20410 and #18982 are not regression: they are problem which have been present in all the release so far, but they are showing up more and more "thanks" to all the models having the dualism inputCol/inputCols. Every usage of |
@mgaido91 Thanks for your thoughts. We do need to persist default values; please check out the JIRA. For fixing Python, I think the best fix will be to transfer the default & explicitly set Params to Java separately, rather than treating them all as explicitly set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated fix @viirya ! This approach looks great to me. I just had 2 tiny comments for Bucketizer, and they also apply to QuantileDiscretizer.
@@ -213,6 +217,9 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |||
override def copy(extra: ParamMap): Bucketizer = { | |||
defaultCopy[Bucketizer](extra).setParent(parent) | |||
} | |||
|
|||
@Since("2.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this since annotation; the signature isn't changed in 2.3.0
// we can get rid of this. | ||
var paramWithoutOutputCol: Option[JValue] = None | ||
if (instance.isSet(instance.inputCols)) { | ||
val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this asInstanceOf cast is necessary.
@jkbradley thanks for your answer. I think that the 3rd approach you suggested on the JIRA is the right way to go on a long term plan. Personally, I disagree with you when you say that we should keep the default values. I think that changing a default value doesn't happen often and if it happens it is not a problem: if the user cares about the value of a parameter, he sets it. But this is just my opinion. |
Thanks @jkbradley ! I've updated this based on your comments. |
Thanks! LGTM pending tests |
Test build #87458 has finished for PR 20594 at commit
|
Merging with master and branch-2.3 |
Well I succeeded in merging this with master, but the merge script isn't working for branch-2.3. I wait to see if the read-only repo syncs and fixes the issue. |
Success! Merged to branch-2.3 too. |
## What changes were proposed in this pull request? #### Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if `inputCols` and `outputCol` are both set, an error will be thrown. However, when we write `Bucketizer`, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default `outputCol` param in `HasOutputCol` trait will be set in `paramMap` and become an user-supplied param. That makes the check of exclusive params failed. #### Fix: This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later. Please see the discussion in the JIRA. Note: The multi-column `QuantileDiscretizer` also has the same issue. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <[email protected]> Closes #20594 from viirya/SPARK-23377-2. (cherry picked from commit db45daa) Signed-off-by: Joseph K. Bradley <[email protected]>
What changes were proposed in this pull request?
Problem:
Since 2.3,
Bucketizer
supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., ifinputCols
andoutputCol
are both set, an error will be thrown.However, when we write
Bucketizer
, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the defaultoutputCol
param inHasOutputCol
trait will be set inparamMap
and become an user-supplied param. That makes the check of exclusive params failed.Fix:
This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later.
Please see the discussion in the JIRA.
Note: The multi-column
QuantileDiscretizer
also has the same issue.How was this patch tested?
Modified tests.