[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776

huaxingao · 2019-09-12T17:24:02Z

What changes were proposed in this pull request?

Add some common classes in Python to make it have the same structure as Scala

Scala has ClassifierParams/Classifier/ClassificationModel:

trait ClassifierParams
    extends PredictorParams with HasRawPredictionCol

abstract class Classifier 
    extends Predictor with ClassifierParams {
    def setRawPredictionCol
}

abstract class ClassificationModel
  extends PredictionModel with ClassifierParams {
    def setRawPredictionCol
}

This PR makes Python has the following:

class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
    pass

class JavaClassifier(JavaPredictor, JavaClassifierParams):
    def setRawPredictionCol

class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams):
    def setRawPredictionCol

Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel:

trait ProbabilisticClassifierParams
    extends ClassifierParams with HasProbabilityCol with HasThresholds

abstract class ProbabilisticClassifier
    extends Classifier with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}

abstract class ProbabilisticClassificationModel
    extends ClassificationModel with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}

This PR makes Python have the following:

class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams):
    pass

class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds

class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds

Scala has PredictorParams/Predictor/PredictionModel:

trait PredictorParams extends Params
    with HasLabelCol with HasFeaturesCol with HasPredictionCol

abstract class Predictor
    extends Estimator with PredictorParams {
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol
  }

abstract class PredictionModel
    extends Model with PredictorParams {
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict
}

This PR makes Python have the following:

class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol):
    pass

class JavaPredictor(JavaEstimator, JavaPredictorParams):
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol

class JavaPredictionModel(JavaModel, JavaPredictorParams):
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict

Why are the changes needed?

Have parity between Python and Scala ML

Does this PR introduce any user-facing change?

Yes. Add the following changes:

LinearSVCModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict

LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict

LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict

How was this patch tested?

Add a few doc tests.

SparkQA · 2019-09-12T17:46:25Z

Test build #110524 has finished for PR 25776 at commit 5c5ea92.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class IsotonicRegression(JavaPredictor, HasWeightCol, JavaMLWritable, JavaMLReadable):
class IsotonicRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):
class AFTSurvivalRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):

python/pyspark/ml/classification.py

zhengruifeng · 2019-09-16T07:14:05Z

python/pyspark/ml/classification.py

    Classes are indexed {0, 1, ..., numClasses - 1}.
    To be mixed in with class:`pyspark.ml.JavaModel`
    """

+    def setRawPredictionCol(self, value):


zhengruifeng · 2019-09-16T07:28:05Z

python/pyspark/ml/classification.py

+        """
+        return self._set(probabilityCol=value)
+
+    def setThresholds(self, value):


zhengruifeng · 2019-09-16T07:31:23Z

python/pyspark/ml/tests/test_param.py

@@ -343,7 +343,8 @@ def test_java_params(self):
        for module in modules:
            for name, cls in inspect.getmembers(module, inspect.isclass):
                if not name.endswith('Model') and not name.endswith('Params') \
-                        and issubclass(cls, JavaParams) and not inspect.isabstract(cls):
+                        and issubclass(cls, JavaParams) and not inspect.isabstract(cls) \
+                        and not name.startswith('Java'):


What will happen if we do not change this place?

If not have the above change, pyspark.ml.tests.test_param.DefaultValuesTests will fail because this _java_obj is never set.

ERROR: test_java_params (pyspark.ml.tests.test_param.DefaultValuesTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/hgao/spark081119/spark/python/pyspark/ml/tests/test_param.py", line 348, in test_java_params check_params(self, cls(), check_params_exist=False) File "/Users/hgao/spark081119/spark/python/pyspark/testing/mlutils.py", line 40, in check_params java_stage = py_stage._to_java() File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 222, in _to_java self._transfer_params_to_java() File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 145, in _transfer_params_to_java pair = self._make_java_param_pair(param, self._defaultParamMap[param]) File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 131, in _make_java_param_pair java_param = self._java_obj.getParam(param.name) AttributeError: 'NoneType' object has no attribute 'getParam'

zhengruifeng · 2019-09-16T07:33:49Z

python/pyspark/ml/wrapper.py

+    """
+
+    @since("3.0.0")
+    def setLabelCol(self, value):


ditto, these methods are provided by JavaPredictorParams
In the future, if we remove the automatically generated setter in code_gen, we can add them back.

…ificationModel/JavaProbabilisticClassifier) in PYTHON

srowen

I'm generally OK with it. Does this actually change what methods are available where? added or removed? or strictly a refactoring?

python/pyspark/ml/classification.py

huaxingao · 2019-09-16T23:38:38Z

@srowen
This PR is mainly for refactoring, but it adds the following get/setXXX.

LinearSVCModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict

LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict

LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict

After next Python/Scala parity jira spark-29093(Remove automatically generated param setters in _shared_params_code_gen.py), the setXXX will be removed from Models, only setXXX are left. This matches Scala behavior exactly.

SparkQA · 2019-09-17T00:11:24Z

Test build #110674 has finished for PR 25776 at commit d6c7da2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class IsotonicRegression(JavaPredictor, HasWeightCol, JavaMLWritable, JavaMLReadable):
class IsotonicRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):
class AFTSurvivalRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):

SparkQA · 2019-09-17T00:13:45Z

Test build #110675 has finished for PR 25776 at commit de6f8e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OneVsRestParams(JavaClassifierParams, HasWeightCol):

zhengruifeng · 2019-09-17T08:01:37Z

python/pyspark/ml/classification.py

-                             HasRawPredictionCol, HasProbabilityCol,
-                             RandomForestParams, TreeClassifierParams, HasCheckpointInterval,
+class RandomForestClassifier(JavaProbabilisticClassifier, HasSeed, RandomForestParams,
+                             TreeClassifierParams, HasCheckpointInterval,
                             JavaMLWritable, JavaMLReadable):


What about adding some simple tests for it?

srowen

Interesting, so these models in Python didn't have any getter/setters for the column names before?

I was going to say it looked funny to be able to set the column on the model after it's fit, but, maybe that's actually fine, to be able to customize what cols it looks for after the fact.

Yes, matching Scala is a fine short-term goal. It sounds like this will need a follow-up?

BTW did those models not have a predict method before? how did they expose the functionality to predict then? (I'm sure I could figure it out, but know you've looked at this closely just now)

python/pyspark/ml/classification.py

huaxingao · 2019-09-17T18:45:38Z

@srowen Thanks for your comments.
Yes, seems these models in python didn't have any getters/setters for the column names before.
Also, the python models didn't have a predict method before (the scala models do). Seems to me that the python users was not able to call the predict method to predict label for the given features. If users needed to get the predictionCol, they would call model.transform, and then get the predictionCol from the transform output.

This PR adds both setters/getters to the python models. After we remove setters generated by _shared_params_code_gen.py in next PR for jira https://issues.apache.org/jira/browse/SPARK-29093, we can control which setters to allow on the models. I initially thought we will not allow setters for python models, but I looked scala code again, seems most of the setters are allowed for scala models. I guess we will match scala code exactly.

SparkQA · 2019-09-17T21:32:06Z

Test build #110829 has finished for PR 25776 at commit 95f88f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-18T07:53:04Z

@srowen Most models in Pyspark do not have any setter/getter (One exception is OneVsRest). And no model has prediction function.

A main complaint about PySpark-ML I heard from the uers of JD's bigdate platform is that they can not set the input/output column name of models. It is inconvenient to rename some columns to avoid column conflicts.
Suppose we deal with a classification task in a interactive mode(like jupyter). We have trained some classification models with default columns names, we evaluate them one by one, and then want to ensamble some good models. Now we must rename the predictionCol of some models after transformation, since all model have the same column name. Otherwise, we need to re-train them with modified column names. Similar cases are easy to happen when we deal with dataframe with tens of columns and try several algorithms. So we want the column setters like the scala side.

The goal is to make the py side in sync with the scala side. It has two benefits: 1, it will be easy to maintain the codebase, when we change the scala side, it is easy to sync in the py side; 2, function parity, methods like models' getter are still missing in the py side.
I try to devide the goal into serveral subtasks in SPARK-28958, after this PR we need to resolve others.

zhengruifeng · 2019-09-18T07:54:41Z

python/pyspark/ml/classification.py

@@ -81,6 +160,8 @@ class LinearSVC(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, Ha
    ...     Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
    >>> svm = LinearSVC(maxIter=5, regParam=0.01)
    >>> model = svm.fit(df)
+    >>> model.setPredictionCol("prediction")


What about changing the value to a non-default value like "newPrediction", and making sure that the ouput dataframe/row has changed column name?

SparkQA · 2019-09-18T16:58:38Z

Test build #110926 has finished for PR 25776 at commit bc1d9e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-19T13:17:43Z

Merged to master

huaxingao · 2019-09-19T14:44:19Z

Thanks! @zhengruifeng @srowen

zero323 · 2019-09-19T21:38:59Z

@huaxingao @srowen @zhengruifeng What about extensibility of such type hierarchy? I'd say it is perfectly valid for Python users to be interested in implementing their own Classifiers, ClassificationModels and so on, without using JVM backend. With such implementation they can only plug-in at the Estimator or Transformer level and recreate whole type hierarchy. That is not a new problem (JavaClassificationModel set a precedence), but these changes really aggravate it.

Wouldn't make more sense to mimic actual Scala hierarchy ClassifierParams, Classifier, ClassificationModel and make Java* extend these? I am thinking from static type checking perspective, but there are other possible applications here.

Additionally I am bit concerned about all the (Private) annotations. These are rather useless for Python users. Why not follow PEP-8 and undersocre these (i.e. _JavaClassifier)?

srowen · 2019-09-20T01:49:00Z

That's a broad question, and I don't have a strong opinion on it, mostly because I am not much of a Python user. I do think the Pyspark hierarchy exists to mimic the Scala hierarchy in general, because it's backed by Scala implementations. That's inevitable. So changes that synchronize the hierarchies are probably good. Making things more conventionally private that should be private is probably also welcome for Spark 3.

zhengruifeng · 2019-09-20T03:45:56Z

@zero323 hi, how newly add common classes in this PR affects the end users to implement their own hierarchy? Could you please provide a user case?

zero323 · 2019-09-20T09:52:47Z

@zhengruifeng

Use cases for public ML type hierarchy

Add Python-only Transformer implementations:
- I am Python user and want to implement pure Python ML classifier without providing JVM backend.
- I want this classifier to be meaningfully positioned in the existing type hierarchy.
- However I have access only to high level classes (Estimator, Model, MLReader / MLReadable).
Run time parameter validation for both user defined (see above) and existing class hierarchy,
- I am a library developer who provides functions that are meaningful only for specific categories of Estimators - here classifiers.
- I want to validate that user passed argument is indeed a classifier:
  - For built-in objects using "private" type hierarchy is not really satisfying (actually, what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala counterparts are public, shouldn't these be too?).
- For user defined objects I can:
  - Use duck typing (on setRawPredictionCol for classifier, on numClasses for classification model) but it hardly satisfying.
  - Provide parallel non-abstract type hierarchy (Classifier or PythonClassifier and so on) and require users to implement such interfaces. That however would require separate logic for checking for built-in and and user-provided classes.
  - Provide parallel abstract type hierarchy, register all existing built-in classes and require users to do the same.
Clearly these are not satisfying solutions as they require either defensive programming or reinventing the same functionality for different 3rd party APIs.

Static type checking

I am either end user or library developer and want to use PEP-484 annotations to indicate components that require classifier or classification model.

Currently I can provide only imprecise annotations, such as

def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...

or try to narrow things down using structural subtyping:

class Classifier(Protocol, Estimator[M]):
    def setRawPredictionCol(self, value: str) -> Classifier: ...

class Classifier(Protocol, Model):
    def setRawPredictionCol(self, value: str) -> Model: ...
    def numClasses(self) -> int: ...

State before this PR

As describe above. None of the described scenarios is really supported and verbose and repetitive workarounds are required.

Impact of this PR

how newly add common classes in this PR affects the end users to implement their own hierarchy

Strictly speaking this PR doesn't doesn't change anything. However it

Solidifies and extends existing, IMHO highly insufficient structure, with it's "virtually private" classes and subordination to JVM interop. This can make any future improvements less likely.
Is a lost opportunity and, at least to some extent, fails to fulfill its own goal to "have the same structure as Scala"

Proposal

Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.

class ClassifierParams: ...

class Predictor(Estimator,PredictorParams):
    def setLabelCol(self, value): ...
    def setFeaturesCol(self, value): ...
    def setPredictionCol(self, value): ...

class Classifier(Predictor, ClassifierParams):
    def setRawPredictionCol(self, value): ...

class PredictionModel(Model,PredictorParams):
    def setFeaturesCol(self, value): ...
    def setPredictionCol(self, value): ...
    def numFeatures(self): ...
    def predict(self, value): ...

and JVM interop should extend from this hierarchy, i.e.

class JavaPredictionModel(PredictionModel): ...

In other words it should be consistent with existing approach, where we have ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and Java* variants are their subclasses.

zhengruifeng · 2019-09-23T06:38:19Z

@zero323
This PR will not directly affect your own implementation, in my opinion.
As to your proposal, I personally think it is reasonable, although Pyspark.ML is
mostly there to wrap the Scala side. We may open another ticket for it.

zero323 · 2019-09-23T10:16:05Z

@zhengruifeng

As to your proposal, I personally think it is reasonable, although Pyspark.ML is
mostly there to wrap the Scala side.

Even if that was the original intention, I'd say that ship has sailed.

First of all nothing in the original API indicated this. On the contrary, the original API clearly suggests that non-Java path is supported, by providing base classes (Params, Transformer, Estimator, Model, ML{Reader,Writer}, ML{Readable,Writable}) as well as Java specific implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, JavaML{Reader,Writer}, JavaML{Readable,Writable}).
Furthermore authoritative (IMHO) and open Python ML extensions exist (spark-sklearn is one of these, but if I recall correctly spark-deep-learning provides so pure-Python utilities). Personally I've seen quite a lot of private implementations, but that's just anecdotal evidence.

Let us assume for the sake of argument that above observations are irrelevant. I will argue that having complete, public type hierarchy is still desired:

Two out three use cases I described, can narrowed down to Java implementation only, though there are less compelling if we do that.
More importantly, public type hierarchy with Java specific extensions, is pyspark.ml standard. There is no reason to treat this specific case as an exception, especially when the implementations, is far from utilitarian (for example implementation-free JavaClassifierParams and JavaProbabilisticClassifierParams save, as for now, no practical purpose whatsoever).

This PR will not directly affect your own implementation

That might be the case, though I think it is better to be cautious here. Multiple inheritance with ABCs and deep type hierarchy can be somewhat tricky, especially if you need both Python 2 and 3 compatibility.

Additionally some subtypes can simply obsolete if a public hierarchy is provided ( aforementioned JavaClassifierParams and JavaProbabilisticClassifierParams are good examples).

It means that proposed complete and public hierarchy might require some API changes, and can be harder to justify, in a minor release.

### What changes were proposed in this pull request? Adds ```python _AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth): ... ``` with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`: https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L377 and `JavaPredictionModel` https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L405 respectively. This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase` https://github.com/apache/spark/blob/eb037a8180be4ab7570eda1fa9cbf3c84b92c3f7/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L48-L50 This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p)) ``` from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth from pyspark.ml.param import Param issubclass(AFTSurvivalRegressionModel, HasMaxIter) # False hasattr(model, "maxIter") and isinstance(model.maxIter, Param) # True issubclass(AFTSurvivalRegressionModel, HasTol) # False hasattr(model, "tol") and isinstance(model.tol, Param) # True ``` and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](#26023), current implementation is technically speaking correct, though incomplete). ### Does this PR introduce any user-facing change? Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet. ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression. Authored-by: zero323 <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? Adds ```python class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ... ``` with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`: https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L377 and `JavaPredictionModel` https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L405 respectively. This however is inconsistent with Scala counterpart where both classes extend private `IsotonicRegressionBase` https://github.com/apache/spark/blob/3cb1b57809d0b4a93223669f5c10cea8fc53eff6/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala#L42-L43 This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e. ```python from pyspark.ml.regression impor IsotonicRegressionMode from pyspark.ml.param.shared import HasWeightCol issubclass(IsotonicRegressionModel, HasWeightCol) # False hasattr(model, "weightCol") # True ``` as well as introduces a bug, by adding unsupported `predict` method: ```python import inspect hasattr(model, "predict") # True inspect.getfullargspec(IsotonicRegressionModel.predict) # FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={}) IsotonicRegressionModel.predict.__doc__ # Predict label for the given features.\n\n .. versionadded:: 3.0.0' model.predict(dataset.first().features) # Py4JError: An error occurred while calling o49.predict. Trace: # py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist # ... ``` Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes. ### Does this PR introduce any user-facing change? Yes. It: - Removes invalid `IsotonicRegressionModel.predict` method. - Adds `HasWeightColumn` to `IsotonicRegressionModel`. however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776). ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression. Authored-by: zero323 <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends. Note - `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented. - `PredictionModel` should be abstract as `_transform` is not implemented. ### Why are the changes needed? To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors. For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or #25776. ### Does this PR introduce any user-facing change? It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same. Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore). It is for discussion if the same should be done to equivalent classes from `ml.wrapper`. If we take `JavaClassifier` as an example, type hierarchy will change from ![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png) to ![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png) Similarly the old model ![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png) will become ![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png) ### How was this patch tested? Existing unit tests. Closes #27245 from zero323/SPARK-29212. Authored-by: zero323 <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

### What changes were proposed in this pull request? Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends. Note - `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented. - `PredictionModel` should be abstract as `_transform` is not implemented. ### Why are the changes needed? To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors. For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or apache#25776. ### Does this PR introduce any user-facing change? It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same. Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore). It is for discussion if the same should be done to equivalent classes from `ml.wrapper`. If we take `JavaClassifier` as an example, type hierarchy will change from ![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png) to ![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png) Similarly the old model ![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png) will become ![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png) ### How was this patch tested? Existing unit tests. Closes apache#27245 from zero323/SPARK-29212. Authored-by: zero323 <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

dongjoon-hyun added MLLIB PYSPARK ML and removed MLLIB labels Sep 13, 2019

zhengruifeng reviewed Sep 16, 2019

View reviewed changes

huaxingao added 2 commits September 16, 2019 08:59

[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClass…

fed9831

…ificationModel/JavaProbabilisticClassifier) in PYTHON

add a few changes

d6c7da2

srowen reviewed Sep 16, 2019

View reviewed changes

python/pyspark/ml/classification.py Show resolved Hide resolved

python/pyspark/ml/classification.py Show resolved Hide resolved

huaxingao force-pushed the spark-28985 branch from 5c5ea92 to d6c7da2 Compare September 16, 2019 23:39

address comments

de6f8e8

zhengruifeng reviewed Sep 17, 2019

View reviewed changes

srowen reviewed Sep 17, 2019

View reviewed changes

python/pyspark/ml/classification.py Show resolved Hide resolved

add simple doc test for RandomForestClassificationModel

95f88f5

zhengruifeng reviewed Sep 18, 2019

View reviewed changes

address comments and add a few more doctests

bc1d9e1

zhengruifeng approved these changes Sep 19, 2019

View reviewed changes

srowen closed this in e97b55d Sep 19, 2019

huaxingao deleted the spark-28985 branch September 19, 2019 14:44

zero323 mentioned this pull request Sep 20, 2019

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

This was referenced Oct 4, 2019

[SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _IsotonicRegressionBase #26023

Closed

[SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _AFTSurvivalRegressionParams #26024

Closed

This was referenced Jan 17, 2020

[SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels #27241

Closed

[SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend #27245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776

[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776

huaxingao commented Sep 12, 2019

SparkQA commented Sep 12, 2019

zhengruifeng Sep 16, 2019

zhengruifeng Sep 16, 2019

zhengruifeng Sep 16, 2019

huaxingao Sep 16, 2019

zhengruifeng Sep 16, 2019

srowen left a comment

huaxingao commented Sep 16, 2019

SparkQA commented Sep 17, 2019

SparkQA commented Sep 17, 2019

zhengruifeng Sep 17, 2019

srowen left a comment

huaxingao commented Sep 17, 2019

SparkQA commented Sep 17, 2019

zhengruifeng commented Sep 18, 2019

zhengruifeng Sep 18, 2019 •

edited

Loading

SparkQA commented Sep 18, 2019

srowen commented Sep 19, 2019

huaxingao commented Sep 19, 2019

zero323 commented Sep 19, 2019 •

edited

Loading

srowen commented Sep 20, 2019

zhengruifeng commented Sep 20, 2019

zero323 commented Sep 20, 2019 •

edited

Loading

zhengruifeng commented Sep 23, 2019

zero323 commented Sep 23, 2019 •

edited

Loading

[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776

[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776

Conversation

huaxingao commented Sep 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 12, 2019

zhengruifeng Sep 16, 2019

Choose a reason for hiding this comment

zhengruifeng Sep 16, 2019

Choose a reason for hiding this comment

zhengruifeng Sep 16, 2019

Choose a reason for hiding this comment

huaxingao Sep 16, 2019

Choose a reason for hiding this comment

zhengruifeng Sep 16, 2019

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

huaxingao commented Sep 16, 2019

SparkQA commented Sep 17, 2019

SparkQA commented Sep 17, 2019

zhengruifeng Sep 17, 2019

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

huaxingao commented Sep 17, 2019

SparkQA commented Sep 17, 2019

zhengruifeng commented Sep 18, 2019

zhengruifeng Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Sep 18, 2019

srowen commented Sep 19, 2019

huaxingao commented Sep 19, 2019

zero323 commented Sep 19, 2019 • edited Loading

srowen commented Sep 20, 2019

zhengruifeng commented Sep 20, 2019

zero323 commented Sep 20, 2019 • edited Loading

zhengruifeng commented Sep 23, 2019

zero323 commented Sep 23, 2019 • edited Loading

zhengruifeng Sep 18, 2019 •

edited

Loading

zero323 commented Sep 19, 2019 •

edited

Loading

zero323 commented Sep 20, 2019 •

edited

Loading

zero323 commented Sep 23, 2019 •

edited

Loading