Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30144][ML][PySpark] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams #26838

Closed
wants to merge 14 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams

Why are the changes needed?

Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams to expose the training params, so user can see these params when calling extractParamMap

Does this PR introduce any user-facing change?

Yes. The MultilayerPerceptronParams such as seed, maxIter ... are available in MultilayerPerceptronClassificationModel now

How was this patch tested?

Manually tested MultilayerPerceptronClassificationModel.extractParamMap() to verify all the new params are there.

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115115 has finished for PR 26838 at commit fc2cc5a.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MultilayerPerceptronClassificationModel(JavaProbabilisticClassificationModel,

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115121 has finished for PR 26838 at commit 09bca1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -328,6 +328,9 @@ object MimaExcludes {
// [SPARK-26457] Show hadoop configurations in HistoryServer environment tab
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.status.api.v1.ApplicationEnvironmentInfo.this"),

// [SPARK-30144][ML] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.layers"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. Is this worth to break the API, @huaxingao ?

@@ -273,29 +273,29 @@ object MultilayerPerceptronClassifier
* Each layer has sigmoid activation function, output layer has softmax.
*
* @param uid uid
* @param layers array of layer sizes including input and output layers
* @param modelLayers array of layer sizes including input and output layers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the constructor is private, but is it necessary to change this name?

Copy link
Contributor

@zhengruifeng zhengruifeng Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is needed, since all estimators and their models should share the same params, and there is by chance a param named layers...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just removing modelLayers in model?
since the value (array of layer sizes) can be easily obtain by $(layers)

* @param weights the weights of layers
*/
@Since("1.5.0")
class MultilayerPerceptronClassificationModel private[ml] (
@Since("1.5.0") override val uid: String,
@Since("1.5.0") val layers: Array[Int],
@Since("1.5.0") val modelLayers: Array[Int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, why need to change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see. MultilayerPerceptronParams has layers too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename layers of MultilayerPerceptronParams instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but renaming layers of MultilayerPerceptronParams will also break the API.

@huaxingao
Copy link
Contributor Author

@dongjoon-hyun @srowen @viirya Thanks for the review.

Since MultilayerPerceptronParams has layers, after MultilayerPerceptronClassificationModel extends MultilayerPerceptronParams, I have to rename layers. It's not good to rename layers in MultilayerPerceptronParams because the getter/setter are public APIs.

MultilayerPerceptronClassificationModel is the only one that doesn't have the training params. All the other XXXModel extend the corresponding XXXParams. In addition, as what is said in the description of the jira https://issues.apache.org/jira/browse/SPARK-30144, user wants to have a way to track what parameters are best during a crossvalidation, so I think it makes sense to expose MultilayerPerceptronParams to MultilayerPerceptronClassificationModel

@huaxingao
Copy link
Contributor Author

cc @zhengruifeng

@zhengruifeng
Copy link
Contributor

It seems that this will be a breaking change, so maybe we need to depreicate model's layers in 2.4.x?

@zhengruifeng
Copy link
Contributor

current var layers in MultilayerPerceptronClassificationModel is just $(layers) in the estimator, so I suggest just remove it in the model, and ignore it in model load.

@huaxingao
Copy link
Contributor Author

I can remove layers from MultilayerPerceptronClassificationModel, but seems it may need more changes than current fix:
member variables numFeatures and mlpModel depend on layers at class initiation time, so these two need to be changed.
since no layers any more, writer and reader need to be changed accordingly.

So I guess probably keep the current fix?

@srowen
Copy link
Member

srowen commented Dec 13, 2019

I see, so it's not just a question of the API, but the name of the field in the serialized model? Hm, yeah. I think it's OK to change and standardize the name in 3.0, but needs to be able to read 'layers' from previous models if present. Is that the only issue here?

@zhengruifeng
Copy link
Contributor

@huaxingao I guess we can mark numFeatures & mlpModel lazy. Moreover, I think we should mark mlpModel transient, since it is needed only in transform.

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115309 has finished for PR 26838 at commit 14ce378.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

I manually saved 2.4.4 model and loaded using 3.0.0. It worked OK.

@Since("2.0.0") val weights: Vector)
extends ProbabilisticClassificationModel[Vector, MultilayerPerceptronClassificationModel]
with Serializable with MLWritable {
with MultilayerPerceptronParams with Serializable with MLWritable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this change. But do we use MultilayerPerceptronClassificationModel in executors? Like not every classification model extends Serializable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. Seems only the tree related model extends Serializable.

@@ -347,13 +355,13 @@ object MultilayerPerceptronClassificationModel
class MultilayerPerceptronClassificationModelWriter(
instance: MultilayerPerceptronClassificationModel) extends MLWriter {

private case class Data(layers: Array[Int], weights: Vector)
private case class Data(weights: Vector)

override protected def saveImpl(path: String): Unit = {
// Save metadata and Params
DefaultParamsWriter.saveMetadata(instance, path, sc)
// Save model data: layers, weights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no layers now.

val model = new MultilayerPerceptronClassificationModel(metadata.uid, layers, weights)

val columns = sparkSession.read.parquet(dataPath).columns
val model = if (columns.length == 2) { // model prior to 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an example old model and read it in one test case?

val weights = data.getAs[Vector](1)
val model = new MultilayerPerceptronClassificationModel(metadata.uid, layers, weights)

val columns = sparkSession.read.parquet(dataPath).columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we can read from the data path in any case. Then we check the length of returned array to decide we should read layers + weights or only weights. We do not need to read the data twice.

* @param weights the weights of layers
*/
@Since("1.5.0")
class MultilayerPerceptronClassificationModel private[ml] (
@Since("1.5.0") override val uid: String,
@Since("1.5.0") val layers: Array[Int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we update migration guild?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Sean, this question is for you.

@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115351 has finished for PR 26838 at commit 7590bf8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

I tagged the JIRA with release_notes and added a note.
https://issues.apache.org/jira/browse/SPARK-30144

I also put back the layers in python, but I am thinking of removing it again: In the note, I have explained that layers is changed from Array[Int] to IntArrayParam so getLayers should be used instead of layers. Seems to me, there is no need to keep layers.

@srowen
Copy link
Member

srowen commented Dec 27, 2019

I see, so you mean there is no meaningful way to keep the previous setters, because the nature of the param has changed?

@huaxingao
Copy link
Contributor Author

Yes.

@srowen
Copy link
Member

srowen commented Dec 27, 2019

OK, if this is the least change we can make while fixing the inconsistency, I'd be OK with it for 3.0. Even though these weren't technically deprecated, it's a minor API and a minor change, and still legitimate for a major release, versus keeping the inconsistency for years.

@SparkQA
Copy link

SparkQA commented Dec 27, 2019

Test build #115871 has finished for PR 26838 at commit 1833754.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 28, 2019

Test build #115872 has finished for PR 26838 at commit 1833754.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

@huaxingao It seems that the models are saved in different dir?

@huaxingao
Copy link
Contributor Author

@zhengruifeng Yes. The models are now in mllib/src/test/resources/ml-models

@zhengruifeng
Copy link
Contributor

@huaxingao It seems that hashingTF/strIndexer models are stored in test-data.
Totally nit, I think it is better to rename them -2.4.4 instead of -pre3.0, otherwise we may think that they are generated by version Preview release of Spark 3.0

@huaxingao
Copy link
Contributor Author

@zhengruifeng
I changed the dir to mlp-2.4.4. I checked hashingTF/strIndexer models, they are in ml-models.

image

@SparkQA
Copy link

SparkQA commented Dec 30, 2019

Test build #115969 has finished for PR 26838 at commit 07267ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2019

Test build #115971 has finished for PR 26838 at commit fa1797e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -459,7 +459,7 @@ class StringIndexerSuite extends MLTest with DefaultReadWriteTest {
}

test("Load StringIndexderModel prior to Spark 3.0") {
val modelPath = testFile("test-data/strIndexerModel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strIndexerModel-2.4.4?

@@ -89,7 +89,7 @@ class HashingTFSuite extends MLTest with DefaultReadWriteTest {
}

test("SPARK-23469: Load HashingTF prior to Spark 3.0") {
val hashingTFPath = testFile("test-data/hashingTF-pre3.0")
val hashingTFPath = testFile("ml-models/hashingTF-pre3.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashingTF-2.4.4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 1, 2020

Test build #115997 has finished for PR 26838 at commit 40fc5da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 3, 2020

Merged to master

@srowen srowen closed this in d32ed25 Jan 3, 2020
@huaxingao
Copy link
Contributor Author

Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants