Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776

Closed
wants to merge 5 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add some common classes in Python to make it have the same structure as Scala

  1. Scala has ClassifierParams/Classifier/ClassificationModel:
trait ClassifierParams
    extends PredictorParams with HasRawPredictionCol

abstract class Classifier 
    extends Predictor with ClassifierParams {
    def setRawPredictionCol
}

abstract class ClassificationModel
  extends PredictionModel with ClassifierParams {
    def setRawPredictionCol
}

This PR makes Python has the following:

class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
    pass

class JavaClassifier(JavaPredictor, JavaClassifierParams):
    def setRawPredictionCol

class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams):
    def setRawPredictionCol
  1. Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel:
trait ProbabilisticClassifierParams
    extends ClassifierParams with HasProbabilityCol with HasThresholds

abstract class ProbabilisticClassifier
    extends Classifier with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}

abstract class ProbabilisticClassificationModel
    extends ClassificationModel with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}

This PR makes Python have the following:

class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams):
    pass

class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds

class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds
  1. Scala has PredictorParams/Predictor/PredictionModel:
trait PredictorParams extends Params
    with HasLabelCol with HasFeaturesCol with HasPredictionCol

abstract class Predictor
    extends Estimator with PredictorParams {
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol
  }

abstract class PredictionModel
    extends Model with PredictorParams {
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict
}

This PR makes Python have the following:

class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol):
    pass

class JavaPredictor(JavaEstimator, JavaPredictorParams):
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol

class JavaPredictionModel(JavaModel, JavaPredictorParams):
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict

Why are the changes needed?

Have parity between Python and Scala ML

Does this PR introduce any user-facing change?

Yes. Add the following changes:

LinearSVCModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict
LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict
LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict

How was this patch tested?

Add a few doc tests.

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110524 has finished for PR 25776 at commit 5c5ea92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class IsotonicRegression(JavaPredictor, HasWeightCol, JavaMLWritable, JavaMLReadable):
  • class IsotonicRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):
  • class AFTSurvivalRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):

Classes are indexed {0, 1, ..., numClasses - 1}.
To be mixed in with class:`pyspark.ml.JavaModel`
"""

def setRawPredictionCol(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

"""
return self._set(probabilityCol=value)

def setThresholds(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -343,7 +343,8 @@ def test_java_params(self):
for module in modules:
for name, cls in inspect.getmembers(module, inspect.isclass):
if not name.endswith('Model') and not name.endswith('Params') \
and issubclass(cls, JavaParams) and not inspect.isabstract(cls):
and issubclass(cls, JavaParams) and not inspect.isabstract(cls) \
and not name.startswith('Java'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if we do not change this place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not have the above change, pyspark.ml.tests.test_param.DefaultValuesTests will fail because this _java_obj is never set.

ERROR: test_java_params (pyspark.ml.tests.test_param.DefaultValuesTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/hgao/spark081119/spark/python/pyspark/ml/tests/test_param.py", line 348, in test_java_params
    check_params(self, cls(), check_params_exist=False)
  File "/Users/hgao/spark081119/spark/python/pyspark/testing/mlutils.py", line 40, in check_params
    java_stage = py_stage._to_java()
  File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 222, in _to_java
    self._transfer_params_to_java()
  File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 145, in _transfer_params_to_java
    pair = self._make_java_param_pair(param, self._defaultParamMap[param])
  File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 131, in _make_java_param_pair
    java_param = self._java_obj.getParam(param.name)
AttributeError: 'NoneType' object has no attribute 'getParam'

"""

@since("3.0.0")
def setLabelCol(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, these methods are provided by JavaPredictorParams
In the future, if we remove the automatically generated setter in code_gen, we can add them back.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally OK with it. Does this actually change what methods are available where? added or removed? or strictly a refactoring?

@huaxingao
Copy link
Contributor Author

@srowen
This PR is mainly for refactoring, but it adds the following get/setXXX.

LinearSVCModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict

LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict

LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict

After next Python/Scala parity jira spark-29093(Remove automatically generated param setters in _shared_params_code_gen.py), the setXXX will be removed from Models, only setXXX are left. This matches Scala behavior exactly.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110674 has finished for PR 25776 at commit d6c7da2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class IsotonicRegression(JavaPredictor, HasWeightCol, JavaMLWritable, JavaMLReadable):
  • class IsotonicRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):
  • class AFTSurvivalRegressionModel(JavaPredictionModel, JavaMLWritable, JavaMLReadable):

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110675 has finished for PR 25776 at commit de6f8e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OneVsRestParams(JavaClassifierParams, HasWeightCol):

HasRawPredictionCol, HasProbabilityCol,
RandomForestParams, TreeClassifierParams, HasCheckpointInterval,
class RandomForestClassifier(JavaProbabilisticClassifier, HasSeed, RandomForestParams,
TreeClassifierParams, HasCheckpointInterval,
JavaMLWritable, JavaMLReadable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding some simple tests for it?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so these models in Python didn't have any getter/setters for the column names before?

I was going to say it looked funny to be able to set the column on the model after it's fit, but, maybe that's actually fine, to be able to customize what cols it looks for after the fact.

Yes, matching Scala is a fine short-term goal. It sounds like this will need a follow-up?

BTW did those models not have a predict method before? how did they expose the functionality to predict then? (I'm sure I could figure it out, but know you've looked at this closely just now)

@huaxingao
Copy link
Contributor Author

@srowen Thanks for your comments.
Yes, seems these models in python didn't have any getters/setters for the column names before.
Also, the python models didn't have a predict method before (the scala models do). Seems to me that the python users was not able to call the predict method to predict label for the given features. If users needed to get the predictionCol, they would call model.transform, and then get the predictionCol from the transform output.

This PR adds both setters/getters to the python models. After we remove setters generated by _shared_params_code_gen.py in next PR for jira https://issues.apache.org/jira/browse/SPARK-29093, we can control which setters to allow on the models. I initially thought we will not allow setters for python models, but I looked scala code again, seems most of the setters are allowed for scala models. I guess we will match scala code exactly.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110829 has finished for PR 25776 at commit 95f88f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

@srowen Most models in Pyspark do not have any setter/getter (One exception is OneVsRest). And no model has prediction function.

A main complaint about PySpark-ML I heard from the uers of JD's bigdate platform is that they can not set the input/output column name of models. It is inconvenient to rename some columns to avoid column conflicts.
Suppose we deal with a classification task in a interactive mode(like jupyter). We have trained some classification models with default columns names, we evaluate them one by one, and then want to ensamble some good models. Now we must rename the predictionCol of some models after transformation, since all model have the same column name. Otherwise, we need to re-train them with modified column names. Similar cases are easy to happen when we deal with dataframe with tens of columns and try several algorithms. So we want the column setters like the scala side.

The goal is to make the py side in sync with the scala side. It has two benefits: 1, it will be easy to maintain the codebase, when we change the scala side, it is easy to sync in the py side; 2, function parity, methods like models' getter are still missing in the py side.
I try to devide the goal into serveral subtasks in SPARK-28958, after this PR we need to resolve others.

@@ -81,6 +160,8 @@ class LinearSVC(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, Ha
... Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>> svm = LinearSVC(maxIter=5, regParam=0.01)
>>> model = svm.fit(df)
>>> model.setPredictionCol("prediction")
Copy link
Contributor

@zhengruifeng zhengruifeng Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing the value to a non-default value like "newPrediction", and making sure that the ouput dataframe/row has changed column name?

@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110926 has finished for PR 25776 at commit bc1d9e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in e97b55d Sep 19, 2019
@srowen
Copy link
Member

srowen commented Sep 19, 2019

Merged to master

@huaxingao
Copy link
Contributor Author

Thanks! @zhengruifeng @srowen

@huaxingao huaxingao deleted the spark-28985 branch September 19, 2019 14:44
@zero323
Copy link
Member

zero323 commented Sep 19, 2019

@huaxingao @srowen @zhengruifeng What about extensibility of such type hierarchy? I'd say it is perfectly valid for Python users to be interested in implementing their own Classifiers, ClassificationModels and so on, without using JVM backend. With such implementation they can only plug-in at the Estimator or Transformer level and recreate whole type hierarchy. That is not a new problem (JavaClassificationModel set a precedence), but these changes really aggravate it.

Wouldn't make more sense to mimic actual Scala hierarchy ClassifierParams, Classifier, ClassificationModel and make Java* extend these? I am thinking from static type checking perspective, but there are other possible applications here.

Additionally I am bit concerned about all the (Private) annotations. These are rather useless for Python users. Why not follow PEP-8 and undersocre these (i.e. _JavaClassifier)?

@srowen
Copy link
Member

srowen commented Sep 20, 2019

That's a broad question, and I don't have a strong opinion on it, mostly because I am not much of a Python user. I do think the Pyspark hierarchy exists to mimic the Scala hierarchy in general, because it's backed by Scala implementations. That's inevitable. So changes that synchronize the hierarchies are probably good. Making things more conventionally private that should be private is probably also welcome for Spark 3.

@zhengruifeng
Copy link
Contributor

@zero323 hi, how newly add common classes in this PR affects the end users to implement their own hierarchy? Could you please provide a user case?

@zero323
Copy link
Member

zero323 commented Sep 20, 2019

@zhengruifeng

Use cases for public ML type hierarchy

  • Add Python-only Transformer implementations:

    • I am Python user and want to implement pure Python ML classifier without providing JVM backend.
    • I want this classifier to be meaningfully positioned in the existing type hierarchy.
    • However I have access only to high level classes (Estimator, Model, MLReader / MLReadable).
  • Run time parameter validation for both user defined (see above) and existing class hierarchy,

    • I am a library developer who provides functions that are meaningful only for specific categories of Estimators - here classifiers.
    • I want to validate that user passed argument is indeed a classifier:
      • For built-in objects using "private" type hierarchy is not really satisfying (actually, what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala counterparts are public, shouldn't these be too?).
    • For user defined objects I can:
      • Use duck typing (on setRawPredictionCol for classifier, on numClasses for classification model) but it hardly satisfying.
      • Provide parallel non-abstract type hierarchy (Classifier or PythonClassifier and so on) and require users to implement such interfaces. That however would require separate logic for checking for built-in and and user-provided classes.
      • Provide parallel abstract type hierarchy, register all existing built-in classes and require users to do the same.

    Clearly these are not satisfying solutions as they require either defensive programming or reinventing the same functionality for different 3rd party APIs.

  • Static type checking

    • I am either end user or library developer and want to use PEP-484 annotations to indicate components that require classifier or classification model.

    • Currently I can provide only imprecise annotations, such as

      def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...

      or try to narrow things down using structural subtyping:

      class Classifier(Protocol, Estimator[M]):
          def setRawPredictionCol(self, value: str) -> Classifier: ...
      
      class Classifier(Protocol, Model):
          def setRawPredictionCol(self, value: str) -> Model: ...
          def numClasses(self) -> int: ...

State before this PR

As describe above. None of the described scenarios is really supported and verbose and repetitive workarounds are required.

Impact of this PR

how newly add common classes in this PR affects the end users to implement their own hierarchy

Strictly speaking this PR doesn't doesn't change anything. However it

  • Solidifies and extends existing, IMHO highly insufficient structure, with it's "virtually private" classes and subordination to JVM interop. This can make any future improvements less likely.
  • Is a lost opportunity and, at least to some extent, fails to fulfill its own goal to "have the same structure as Scala"

Proposal

Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.

class ClassifierParams: ...

class Predictor(Estimator,PredictorParams):
    def setLabelCol(self, value): ...
    def setFeaturesCol(self, value): ...
    def setPredictionCol(self, value): ...

class Classifier(Predictor, ClassifierParams):
    def setRawPredictionCol(self, value): ...

class PredictionModel(Model,PredictorParams):
    def setFeaturesCol(self, value): ...
    def setPredictionCol(self, value): ...
    def numFeatures(self): ...
    def predict(self, value): ...

and JVM interop should extend from this hierarchy, i.e.

class JavaPredictionModel(PredictionModel): ...

In other words it should be consistent with existing approach, where we have ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and Java* variants are their subclasses.

@zhengruifeng
Copy link
Contributor

@zero323
This PR will not directly affect your own implementation, in my opinion.
As to your proposal, I personally think it is reasonable, although Pyspark.ML is
mostly there to wrap the Scala side. We may open another ticket for it.

@zero323
Copy link
Member

zero323 commented Sep 23, 2019

@zhengruifeng

As to your proposal, I personally think it is reasonable, although Pyspark.ML is
mostly there to wrap the Scala side.

Even if that was the original intention, I'd say that ship has sailed.

  • First of all nothing in the original API indicated this. On the contrary, the original API clearly suggests that non-Java path is supported, by providing base classes (Params, Transformer, Estimator, Model, ML{Reader,Writer}, ML{Readable,Writable}) as well as Java specific implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, JavaML{Reader,Writer}, JavaML{Readable,Writable}).

  • Furthermore authoritative (IMHO) and open Python ML extensions exist (spark-sklearn is one of these, but if I recall correctly spark-deep-learning provides so pure-Python utilities). Personally I've seen quite a lot of private implementations, but that's just anecdotal evidence.

Let us assume for the sake of argument that above observations are irrelevant. I will argue that having complete, public type hierarchy is still desired:

  • Two out three use cases I described, can narrowed down to Java implementation only, though there are less compelling if we do that.
  • More importantly, public type hierarchy with Java specific extensions, is pyspark.ml standard. There is no reason to treat this specific case as an exception, especially when the implementations, is far from utilitarian (for example implementation-free JavaClassifierParams and JavaProbabilisticClassifierParams save, as for now, no practical purpose whatsoever).

This PR will not directly affect your own implementation

That might be the case, though I think it is better to be cautious here. Multiple inheritance with ABCs and deep type hierarchy can be somewhat tricky, especially if you need both Python 2 and 3 compatibility.

Additionally some subtypes can simply obsolete if a public hierarchy is provided ( aforementioned JavaClassifierParams and JavaProbabilisticClassifierParams are good examples).

It means that proposed complete and public hierarchy might require some API changes, and can be harder to justify, in a minor release.

srowen pushed a commit that referenced this pull request Oct 4, 2019
### What changes were proposed in this pull request?

Adds

```python
_AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol,
                                   HasMaxIter, HasTol, HasFitIntercept,
                                   HasAggregationDepth): ...
```

with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and  `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes.

### Why are the changes needed?

Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and  `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`:

https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L377

and `JavaPredictionModel`

https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L405

respectively.

This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase`

https://github.com/apache/spark/blob/eb037a8180be4ab7570eda1fa9cbf3c84b92c3f7/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L48-L50

This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p))

```
from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel
from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth
from pyspark.ml.param import Param

issubclass(AFTSurvivalRegressionModel, HasMaxIter)
# False
hasattr(model, "maxIter")  and isinstance(model.maxIter, Param)
# True

issubclass(AFTSurvivalRegressionModel, HasTol)
# False
hasattr(model, "tol")  and isinstance(model.tol, Param)
# True
```

and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](#26023), current implementation is technically speaking correct, though incomplete).

### Does this PR introduce any user-facing change?

Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet.

### How was this patch tested?

- Existing unit tests.
- Manual testing.

CC huaxingao, zhengruifeng

Closes #26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression.

Authored-by: zero323 <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request Oct 4, 2019
### What changes were proposed in this pull request?

Adds

```python
class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ...
```

with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes.

### Why are the changes needed?

Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`:

https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L377

and `JavaPredictionModel`

https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L405

respectively.

This however is inconsistent with Scala counterpart where both  classes extend private `IsotonicRegressionBase`

https://github.com/apache/spark/blob/3cb1b57809d0b4a93223669f5c10cea8fc53eff6/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala#L42-L43

This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e.

```python
from pyspark.ml.regression impor IsotonicRegressionMode
from pyspark.ml.param.shared import HasWeightCol

issubclass(IsotonicRegressionModel, HasWeightCol)
# False

hasattr(model, "weightCol")
# True
```

as well as introduces a bug, by adding unsupported `predict` method:

```python
import inspect

hasattr(model, "predict")
# True

inspect.getfullargspec(IsotonicRegressionModel.predict)
# FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={})

IsotonicRegressionModel.predict.__doc__
# Predict label for the given features.\n\n        .. versionadded:: 3.0.0'

model.predict(dataset.first().features)

# Py4JError: An error occurred while calling o49.predict. Trace:
# py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist
# ...

```

Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes.

### Does this PR introduce any user-facing change?

Yes. It:

- Removes invalid `IsotonicRegressionModel.predict` method.
- Adds `HasWeightColumn` to `IsotonicRegressionModel`.

however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776).

### How was this patch tested?

- Existing unit tests.
- Manual testing.

CC huaxingao, zhengruifeng

Closes #26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression.

Authored-by: zero323 <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
zhengruifeng pushed a commit that referenced this pull request Mar 4, 2020
### What changes were proposed in this pull request?

Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends.

Note

- `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented.
- `PredictionModel` should be abstract as `_transform` is not implemented.

### Why are the changes needed?

To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors.

For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or #25776.

### Does this PR introduce any user-facing change?

It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same.

Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore).

It is for discussion if the same should be done to equivalent classes from `ml.wrapper`.

If we take `JavaClassifier` as an example, type hierarchy will change from

![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png)

to

![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png)

Similarly the old model

![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png)

will become

![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png)

### How was this patch tested?

Existing unit tests.

Closes #27245 from zero323/SPARK-29212.

Authored-by: zero323 <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends.

Note

- `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented.
- `PredictionModel` should be abstract as `_transform` is not implemented.

### Why are the changes needed?

To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors.

For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or apache#25776.

### Does this PR introduce any user-facing change?

It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same.

Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore).

It is for discussion if the same should be done to equivalent classes from `ml.wrapper`.

If we take `JavaClassifier` as an example, type hierarchy will change from

![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png)

to

![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png)

Similarly the old model

![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png)

will become

![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png)

### How was this patch tested?

Existing unit tests.

Closes apache#27245 from zero323/SPARK-29212.

Authored-by: zero323 <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants