[SPARK-30144][ML][PySpark] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams #26838

huaxingao · 2019-12-10T17:52:23Z

What changes were proposed in this pull request?

Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams

Why are the changes needed?

Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams to expose the training params, so user can see these params when calling extractParamMap

Does this PR introduce any user-facing change?

Yes. The MultilayerPerceptronParams such as seed, maxIter ... are available in MultilayerPerceptronClassificationModel now

How was this patch tested?

Manually tested MultilayerPerceptronClassificationModel.extractParamMap() to verify all the new params are there.

…el extend MultilayerPerceptronParams

SparkQA · 2019-12-10T18:17:00Z

Test build #115115 has finished for PR 26838 at commit fc2cc5a.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MultilayerPerceptronClassificationModel(JavaProbabilisticClassificationModel,

SparkQA · 2019-12-10T22:00:22Z

Test build #115121 has finished for PR 26838 at commit 09bca1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-12-11T17:31:42Z

project/MimaExcludes.scala

@@ -328,6 +328,9 @@ object MimaExcludes {
    // [SPARK-26457] Show hadoop configurations in HistoryServer environment tab
    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.status.api.v1.ApplicationEnvironmentInfo.this"),

+    // [SPARK-30144][ML] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams
+    ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.layers"),


Just a question. Is this worth to break the API, @huaxingao ?

srowen · 2019-12-11T18:06:49Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

@@ -273,29 +273,29 @@ object MultilayerPerceptronClassifier
 * Each layer has sigmoid activation function, output layer has softmax.
 *
 * @param uid uid
- * @param layers array of layer sizes including input and output layers
+ * @param modelLayers array of layer sizes including input and output layers


I know the constructor is private, but is it necessary to change this name?

I think it is needed, since all estimators and their models should share the same params, and there is by chance a param named layers...

What about just removing modelLayers in model?
since the value (array of layer sizes) can be easily obtain by $(layers)

viirya · 2019-12-12T01:25:34Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

 * @param weights the weights of layers
 */
 @Since("1.5.0")
 class MultilayerPerceptronClassificationModel private[ml] (
    @Since("1.5.0") override val uid: String,
-    @Since("1.5.0") val layers: Array[Int],
+    @Since("1.5.0") val modelLayers: Array[Int],


same question, why need to change this?

oh, i see. MultilayerPerceptronParams has layers too?

Can we rename layers of MultilayerPerceptronParams instead?

but renaming layers of MultilayerPerceptronParams will also break the API.

huaxingao · 2019-12-12T06:02:34Z

@dongjoon-hyun @srowen @viirya Thanks for the review.

Since MultilayerPerceptronParams has layers, after MultilayerPerceptronClassificationModel extends MultilayerPerceptronParams, I have to rename layers. It's not good to rename layers in MultilayerPerceptronParams because the getter/setter are public APIs.

MultilayerPerceptronClassificationModel is the only one that doesn't have the training params. All the other XXXModel extend the corresponding XXXParams. In addition, as what is said in the description of the jira https://issues.apache.org/jira/browse/SPARK-30144, user wants to have a way to track what parameters are best during a crossvalidation, so I think it makes sense to expose MultilayerPerceptronParams to MultilayerPerceptronClassificationModel

huaxingao · 2019-12-12T06:02:56Z

cc @zhengruifeng

zhengruifeng · 2019-12-12T09:00:04Z

It seems that this will be a breaking change, so maybe we need to depreicate model's layers in 2.4.x?

zhengruifeng · 2019-12-12T09:02:21Z

current var layers in MultilayerPerceptronClassificationModel is just $(layers) in the estimator, so I suggest just remove it in the model, and ignore it in model load.

huaxingao · 2019-12-12T21:43:52Z

I can remove layers from MultilayerPerceptronClassificationModel, but seems it may need more changes than current fix:
member variables numFeatures and mlpModel depend on layers at class initiation time, so these two need to be changed.
since no layers any more, writer and reader need to be changed accordingly.

So I guess probably keep the current fix?

srowen · 2019-12-13T00:46:10Z

I see, so it's not just a question of the API, but the name of the field in the serialized model? Hm, yeah. I think it's OK to change and standardize the name in 3.0, but needs to be able to read 'layers' from previous models if present. Is that the only issue here?

zhengruifeng · 2019-12-13T02:00:19Z

@huaxingao I guess we can mark numFeatures & mlpModel lazy. Moreover, I think we should mark mlpModel transient, since it is needed only in transform.

SparkQA · 2019-12-13T18:39:07Z

Test build #115309 has finished for PR 26838 at commit 14ce378.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2019-12-13T19:03:06Z

I manually saved 2.4.4 model and loaded using 3.0.0. It worked OK.

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

viirya · 2019-12-13T22:46:53Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

    @Since("2.0.0") val weights: Vector)
  extends ProbabilisticClassificationModel[Vector, MultilayerPerceptronClassificationModel]
-  with Serializable with MLWritable {
+  with MultilayerPerceptronParams with Serializable with MLWritable {


Not related to this change. But do we use MultilayerPerceptronClassificationModel in executors? Like not every classification model extends Serializable.

I am not sure about this. Seems only the tree related model extends Serializable.

viirya · 2019-12-13T22:49:31Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

@@ -347,13 +355,13 @@ object MultilayerPerceptronClassificationModel
  class MultilayerPerceptronClassificationModelWriter(
      instance: MultilayerPerceptronClassificationModel) extends MLWriter {

-    private case class Data(layers: Array[Int], weights: Vector)
+    private case class Data(weights: Vector)

    override protected def saveImpl(path: String): Unit = {
      // Save metadata and Params
      DefaultParamsWriter.saveMetadata(instance, path, sc)
      // Save model data: layers, weights


no layers now.

viirya · 2019-12-13T22:53:13Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

-      val model = new MultilayerPerceptronClassificationModel(metadata.uid, layers, weights)
-
+      val columns = sparkSession.read.parquet(dataPath).columns
+      val model = if (columns.length == 2) { // model prior to 3.0.0


Should we have an example old model and read it in one test case?

viirya · 2019-12-13T22:57:15Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

-      val weights = data.getAs[Vector](1)
-      val model = new MultilayerPerceptronClassificationModel(metadata.uid, layers, weights)
-
+      val columns = sparkSession.read.parquet(dataPath).columns


Hmm, I think we can read from the data path in any case. Then we check the length of returned array to decide we should read layers + weights or only weights. We do not need to read the data twice.

viirya · 2019-12-13T22:59:08Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

 * @param weights the weights of layers
 */
 @Since("1.5.0")
 class MultilayerPerceptronClassificationModel private[ml] (
    @Since("1.5.0") override val uid: String,
-    @Since("1.5.0") val layers: Array[Int],


Shall we update migration guild?

@srowen Sean, this question is for you.

SparkQA · 2019-12-15T08:05:01Z

Test build #115351 has finished for PR 26838 at commit 7590bf8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2019-12-27T01:51:37Z

I tagged the JIRA with release_notes and added a note.
https://issues.apache.org/jira/browse/SPARK-30144

I also put back the layers in python, but I am thinking of removing it again: In the note, I have explained that layers is changed from Array[Int] to IntArrayParam so getLayers should be used instead of layers. Seems to me, there is no need to keep layers.

srowen · 2019-12-27T02:18:55Z

I see, so you mean there is no meaningful way to keep the previous setters, because the nature of the param has changed?

huaxingao · 2019-12-27T02:27:08Z

Yes.

srowen · 2019-12-27T16:31:20Z

OK, if this is the least change we can make while fixing the inconsistency, I'd be OK with it for 3.0. Even though these weren't technically deprecated, it's a minor API and a minor change, and still legitimate for a major release, versus keeping the inconsistency for years.

SparkQA · 2019-12-27T21:27:16Z

Test build #115871 has finished for PR 26838 at commit 1833754.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2019-12-27T22:22:24Z

retest this please

SparkQA · 2019-12-28T00:46:58Z

Test build #115872 has finished for PR 26838 at commit 1833754.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-12-30T02:41:47Z

@huaxingao It seems that the models are saved in different dir?

huaxingao · 2019-12-30T06:04:01Z

@zhengruifeng Yes. The models are now in mllib/src/test/resources/ml-models

zhengruifeng · 2019-12-30T11:15:25Z

@huaxingao It seems that hashingTF/strIndexer models are stored in test-data.
Totally nit, I think it is better to rename them -2.4.4 instead of -pre3.0, otherwise we may think that they are generated by version Preview release of Spark 3.0

huaxingao · 2019-12-30T19:25:35Z

@zhengruifeng
I changed the dir to mlp-2.4.4. I checked hashingTF/strIndexer models, they are in ml-models.

SparkQA · 2019-12-30T21:03:05Z

Test build #115969 has finished for PR 26838 at commit 07267ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-30T23:42:39Z

Test build #115971 has finished for PR 26838 at commit fa1797e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-12-31T06:27:03Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

@@ -459,7 +459,7 @@ class StringIndexerSuite extends MLTest with DefaultReadWriteTest {
  }

  test("Load StringIndexderModel prior to Spark 3.0") {
-    val modelPath = testFile("test-data/strIndexerModel")


strIndexerModel-2.4.4？

zhengruifeng · 2019-12-31T06:27:15Z

mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala

@@ -89,7 +89,7 @@ class HashingTFSuite extends MLTest with DefaultReadWriteTest {
  }

  test("SPARK-23469: Load HashingTF prior to Spark 3.0") {
-    val hashingTFPath = testFile("test-data/hashingTF-pre3.0")
+    val hashingTFPath = testFile("ml-models/hashingTF-pre3.0")


hashingTF-2.4.4

updated. Thanks!

SparkQA · 2020-01-01T01:46:45Z

Test build #115997 has finished for PR 26838 at commit 40fc5da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-01-03T18:01:17Z

Merged to master

huaxingao · 2020-01-03T18:19:28Z

Thank you all!

[SPARK-30144][ML][PySpark] Make MultilayerPerceptronClassificationMod…

fc2cc5a

…el extend MultilayerPerceptronParams

fix mima failure

09bca1e

dongjoon-hyun added ML PYSPARK labels Dec 11, 2019

dongjoon-hyun reviewed Dec 11, 2019

View reviewed changes

srowen reviewed Dec 11, 2019

View reviewed changes

viirya reviewed Dec 12, 2019

View reviewed changes

dongjoon-hyun added MLLIB and removed ML PYSPARK labels Dec 12, 2019

address comments

14ce378

srowen reviewed Dec 13, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala Show resolved Hide resolved

viirya reviewed Dec 13, 2019

View reviewed changes

address comments

7590bf8

remove layers

1833754

zhengruifeng added ML PYSPARK and removed MLLIB labels Dec 30, 2019

change dir from mlp-pre3.0 to mlp-2.4.4

07267ff

fix test failure

fa1797e

zhengruifeng reviewed Dec 31, 2019

View reviewed changes

change model path to 2.4.4

40fc5da

zhengruifeng approved these changes Jan 3, 2020

View reviewed changes

srowen closed this in d32ed25 Jan 3, 2020

huaxingao deleted the spark-30144 branch January 3, 2020 18:19

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

[SPARK-30144][ML][PySpark] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams #26838

[SPARK-30144][ML][PySpark] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams #26838

Conversation

huaxingao commented Dec 10, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 10, 2019

SparkQA commented Dec 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Dec 12, 2019

huaxingao commented Dec 12, 2019

zhengruifeng commented Dec 12, 2019

zhengruifeng commented Dec 12, 2019

huaxingao commented Dec 12, 2019

srowen commented Dec 13, 2019

zhengruifeng commented Dec 13, 2019

SparkQA commented Dec 13, 2019

huaxingao commented Dec 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 15, 2019

huaxingao commented Dec 27, 2019

srowen commented Dec 27, 2019

huaxingao commented Dec 27, 2019

srowen commented Dec 27, 2019

SparkQA commented Dec 27, 2019

huaxingao commented Dec 27, 2019

SparkQA commented Dec 28, 2019

zhengruifeng commented Dec 30, 2019

huaxingao commented Dec 30, 2019

zhengruifeng commented Dec 30, 2019

huaxingao commented Dec 30, 2019

SparkQA commented Dec 30, 2019

SparkQA commented Dec 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 1, 2020

srowen commented Jan 3, 2020

huaxingao commented Jan 3, 2020

zhengruifeng Dec 12, 2019 •

edited

Loading