-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23455][ML] Default Params in ML should be saved separately in metadata #20633
Conversation
Test build #87521 has finished for PR 20633 at commit
|
cc @jkbradley This is to save default params separately in metadata file. Please help review after 2.3. Thanks! |
also cc @MLnick @WeichenXu123 for review. |
*/ | ||
def getParamValue(paramName: String): JValue = { | ||
def getParamValue(paramName: String, isDefaultParam: Boolean = false): JValue = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the logic be "lookup params first, if not exist then lookup defaultParams" ?
If so, we don't need the isDefaultParam
param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will change this.
@viirya Have you run backward compatibility test ? e.g., saving estimator/models via master version spark, and load estimator/models via this PR version spark. To check whether it works fine and all params get saved and restored correctly. |
@WeichenXu123 I've added unit test in Sounds like the backward compatibility test you suggested should be checked manually. I will test it. Thanks! |
Test build #87996 has finished for PR 20633 at commit
|
retest this please. |
@WeichenXu123 Thanks for comment! I've run the backward compatibility test locally against |
LGTM. Thanks! @jkbradley |
Test build #87998 has finished for PR 20633 at commit
|
cc @dbtsai if you have time to look at this too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I made a review pass, and it looks like a nice setup.
@@ -791,7 +791,7 @@ trait Params extends Identifiable with Serializable { | |||
* this method gets called. | |||
* @param value the default value | |||
*/ | |||
protected final def setDefault[T](param: Param[T], value: T): this.type = { | |||
private[ml] final def setDefault[T](param: Param[T], value: T): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leave this as protected. It's important that 3rd-party libraries be able to extend MLlib APIs, and this is one they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need to call setDefault
in Metadata.setParams
, I can't leave it as protected
. But I think it is important to keep the extensibility. And I think we can't make it as public
too. I add object Params
and use it to help us update default param of a Params
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense; I like your solution.
val basicMetadata = ("class" -> cls) ~ | ||
("timestamp" -> System.currentTimeMillis()) ~ | ||
("sparkVersion" -> sc.version) ~ | ||
("uid" -> uid) ~ | ||
("defaultParamMap" -> jsonDefaultParams) ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: How about putting this below paramMap since that's nicer for viewing the JSON?
def getAndSetParams( | ||
instance: Params, | ||
skipParams: Option[List[String]] = None): Unit = { | ||
setParams(instance, false, skipParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: It's nice to pass boolean args by name
} | ||
} | ||
// For metadata file prior to Spark 2.4, there is no default section. | ||
case JNothing if isDefault && (major == 2 && minor < 4 || major < 2) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic would be simpler if this check were put in the getAndSetParams method, which could just skip calling setParams(instance, true, skipParams) for Spark 2.3-.
} | ||
} | ||
|
||
test("User-suppiled value for default param should be kept after persistence") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suppiled -> supplied
assert(loadedMyParams.get(myParams.intParamWithDefault).get == 100) | ||
} | ||
|
||
test("Read metadata without default field prior to 2.4") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like this setup.
Test build #88502 has finished for PR 20633 at commit
|
retest this please. |
Test build #88510 has finished for PR 20633 at commit
|
@jkbradley Thanks for your comments! I've addressed them. Please review it again. Thank you. |
ping @jkbradley |
@@ -905,6 +905,15 @@ trait Params extends Identifiable with Serializable { | |||
} | |||
} | |||
|
|||
object Params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make this object package private (for cleaner docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done.
@@ -791,7 +791,7 @@ trait Params extends Identifiable with Serializable { | |||
* this method gets called. | |||
* @param value the default value | |||
*/ | |||
protected final def setDefault[T](param: Param[T], value: T): this.type = { | |||
private[ml] final def setDefault[T](param: Param[T], value: T): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense; I like your solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow response!
|
||
// For metadata file prior to Spark 2.4, there is no default section. | ||
val (major, minor) = VersionUtils.majorMinorVersion(sparkVersion) | ||
if (major >= 2 && minor >= 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be:
if (major > 2 || (major == 2 && minor >= 4)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're correct. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Test build #88880 has finished for PR 20633 at commit
|
Can you address the conflicts? Gonna start to review it soon. Thanks. |
@dbtsai Thanks! I've solved the conflicts. |
Test build #89421 has finished for PR 20633 at commit
|
ping @jkbradley |
Sorry for the pause in review. @dbtsai I'm going to merge this since I'm worried it will collect more conflicts, but let's discuss more if needed. @viirya We'll need to update Python's DefaultParamsReader as well for Spark 2.4 in order to keep it in sync with Scala/Java. R thankfully should not require anything since it only has wrappers. I'll make & link a JIRA: https://issues.apache.org/jira/browse/SPARK-24058 Will you have time to work on that? Thanks @viirya ! |
@jkbradley Thanks! I'll work on SPARK-24058. |
Test build #4156 has finished for PR 20633 at commit
|
What changes were proposed in this pull request?
We save ML's user-supplied params and default params as one entity in metadata. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params.
It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading.
The loaded default params should not be set as user-supplied params. We should save ML default params separately in metadata.
For backward compatibility, when loading metadata, if it is a metadata file from previous Spark, we shouldn't raise error if we can't find the default param field.
How was this patch tested?
Pass existing tests and added tests.