-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON #25776
Conversation
Test build #110524 has finished for PR 25776 at commit
|
Classes are indexed {0, 1, ..., numClasses - 1}. | ||
To be mixed in with class:`pyspark.ml.JavaModel` | ||
""" | ||
|
||
def setRawPredictionCol(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
""" | ||
return self._set(probabilityCol=value) | ||
|
||
def setThresholds(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -343,7 +343,8 @@ def test_java_params(self): | |||
for module in modules: | |||
for name, cls in inspect.getmembers(module, inspect.isclass): | |||
if not name.endswith('Model') and not name.endswith('Params') \ | |||
and issubclass(cls, JavaParams) and not inspect.isabstract(cls): | |||
and issubclass(cls, JavaParams) and not inspect.isabstract(cls) \ | |||
and not name.startswith('Java'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if we do not change this place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not have the above change, pyspark.ml.tests.test_param.DefaultValuesTests
will fail because this _java_obj
is never set.
ERROR: test_java_params (pyspark.ml.tests.test_param.DefaultValuesTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/hgao/spark081119/spark/python/pyspark/ml/tests/test_param.py", line 348, in test_java_params
check_params(self, cls(), check_params_exist=False)
File "/Users/hgao/spark081119/spark/python/pyspark/testing/mlutils.py", line 40, in check_params
java_stage = py_stage._to_java()
File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 222, in _to_java
self._transfer_params_to_java()
File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 145, in _transfer_params_to_java
pair = self._make_java_param_pair(param, self._defaultParamMap[param])
File "/Users/hgao/spark081119/spark/python/pyspark/ml/wrapper.py", line 131, in _make_java_param_pair
java_param = self._java_obj.getParam(param.name)
AttributeError: 'NoneType' object has no attribute 'getParam'
""" | ||
|
||
@since("3.0.0") | ||
def setLabelCol(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, these methods are provided by JavaPredictorParams
In the future, if we remove the automatically generated setter in code_gen, we can add them back.
…ificationModel/JavaProbabilisticClassifier) in PYTHON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally OK with it. Does this actually change what methods are available where? added or removed? or strictly a refactoring?
@srowen
After next Python/Scala parity jira spark-29093(Remove automatically generated param setters in |
5c5ea92
to
d6c7da2
Compare
Test build #110674 has finished for PR 25776 at commit
|
Test build #110675 has finished for PR 25776 at commit
|
HasRawPredictionCol, HasProbabilityCol, | ||
RandomForestParams, TreeClassifierParams, HasCheckpointInterval, | ||
class RandomForestClassifier(JavaProbabilisticClassifier, HasSeed, RandomForestParams, | ||
TreeClassifierParams, HasCheckpointInterval, | ||
JavaMLWritable, JavaMLReadable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding some simple tests for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so these models in Python didn't have any getter/setters for the column names before?
I was going to say it looked funny to be able to set the column on the model after it's fit, but, maybe that's actually fine, to be able to customize what cols it looks for after the fact.
Yes, matching Scala is a fine short-term goal. It sounds like this will need a follow-up?
BTW did those models not have a predict method before? how did they expose the functionality to predict then? (I'm sure I could figure it out, but know you've looked at this closely just now)
@srowen Thanks for your comments. This PR adds both setters/getters to the python |
Test build #110829 has finished for PR 25776 at commit
|
@srowen Most models in Pyspark do not have any setter/getter (One exception is OneVsRest). And no model has prediction function. A main complaint about PySpark-ML I heard from the uers of JD's bigdate platform is that they can not set the input/output column name of models. It is inconvenient to rename some columns to avoid column conflicts. The goal is to make the py side in sync with the scala side. It has two benefits: 1, it will be easy to maintain the codebase, when we change the scala side, it is easy to sync in the py side; 2, function parity, methods like models' getter are still missing in the py side. |
python/pyspark/ml/classification.py
Outdated
@@ -81,6 +160,8 @@ class LinearSVC(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, Ha | |||
... Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() | |||
>>> svm = LinearSVC(maxIter=5, regParam=0.01) | |||
>>> model = svm.fit(df) | |||
>>> model.setPredictionCol("prediction") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing the value to a non-default value like "newPrediction", and making sure that the ouput dataframe/row has changed column name?
Test build #110926 has finished for PR 25776 at commit
|
Merged to master |
Thanks! @zhengruifeng @srowen |
@huaxingao @srowen @zhengruifeng What about extensibility of such type hierarchy? I'd say it is perfectly valid for Python users to be interested in implementing their own Wouldn't make more sense to mimic actual Scala hierarchy Additionally I am bit concerned about all the |
That's a broad question, and I don't have a strong opinion on it, mostly because I am not much of a Python user. I do think the Pyspark hierarchy exists to mimic the Scala hierarchy in general, because it's backed by Scala implementations. That's inevitable. So changes that synchronize the hierarchies are probably good. Making things more conventionally private that should be private is probably also welcome for Spark 3. |
@zero323 hi, how newly add common classes in this PR affects the end users to implement their own hierarchy? Could you please provide a user case? |
Use cases for public ML type hierarchy
State before this PR As describe above. None of the described scenarios is really supported and verbose and repetitive workarounds are required. Impact of this PR
Strictly speaking this PR doesn't doesn't change anything. However it
Proposal Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. class ClassifierParams: ...
class Predictor(Estimator,PredictorParams):
def setLabelCol(self, value): ...
def setFeaturesCol(self, value): ...
def setPredictionCol(self, value): ...
class Classifier(Predictor, ClassifierParams):
def setRawPredictionCol(self, value): ...
class PredictionModel(Model,PredictorParams):
def setFeaturesCol(self, value): ...
def setPredictionCol(self, value): ...
def numFeatures(self): ...
def predict(self, value): ... and JVM interop should extend from this hierarchy, i.e. class JavaPredictionModel(PredictionModel): ... In other words it should be consistent with existing approach, where we have ABCs reflecting Scala API ( |
@zero323 |
Even if that was the original intention, I'd say that ship has sailed.
Let us assume for the sake of argument that above observations are irrelevant. I will argue that having complete, public type hierarchy is still desired:
That might be the case, though I think it is better to be cautious here. Multiple inheritance with ABCs and deep type hierarchy can be somewhat tricky, especially if you need both Python 2 and 3 compatibility. Additionally some subtypes can simply obsolete if a public hierarchy is provided ( aforementioned It means that proposed complete and public hierarchy might require some API changes, and can be harder to justify, in a minor release. |
### What changes were proposed in this pull request? Adds ```python _AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth): ... ``` with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`: https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L377 and `JavaPredictionModel` https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L405 respectively. This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase` https://github.com/apache/spark/blob/eb037a8180be4ab7570eda1fa9cbf3c84b92c3f7/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L48-L50 This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p)) ``` from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth from pyspark.ml.param import Param issubclass(AFTSurvivalRegressionModel, HasMaxIter) # False hasattr(model, "maxIter") and isinstance(model.maxIter, Param) # True issubclass(AFTSurvivalRegressionModel, HasTol) # False hasattr(model, "tol") and isinstance(model.tol, Param) # True ``` and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](#26023), current implementation is technically speaking correct, though incomplete). ### Does this PR introduce any user-facing change? Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet. ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression. Authored-by: zero323 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? Adds ```python class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ... ``` with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`: https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L377 and `JavaPredictionModel` https://github.com/apache/spark/blob/e97b55d32285052a1f76cca35377c4b21eb2e7d7/python/pyspark/ml/wrapper.py#L405 respectively. This however is inconsistent with Scala counterpart where both classes extend private `IsotonicRegressionBase` https://github.com/apache/spark/blob/3cb1b57809d0b4a93223669f5c10cea8fc53eff6/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala#L42-L43 This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e. ```python from pyspark.ml.regression impor IsotonicRegressionMode from pyspark.ml.param.shared import HasWeightCol issubclass(IsotonicRegressionModel, HasWeightCol) # False hasattr(model, "weightCol") # True ``` as well as introduces a bug, by adding unsupported `predict` method: ```python import inspect hasattr(model, "predict") # True inspect.getfullargspec(IsotonicRegressionModel.predict) # FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={}) IsotonicRegressionModel.predict.__doc__ # Predict label for the given features.\n\n .. versionadded:: 3.0.0' model.predict(dataset.first().features) # Py4JError: An error occurred while calling o49.predict. Trace: # py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist # ... ``` Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes. ### Does this PR introduce any user-facing change? Yes. It: - Removes invalid `IsotonicRegressionModel.predict` method. - Adds `HasWeightColumn` to `IsotonicRegressionModel`. however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776). ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression. Authored-by: zero323 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends. Note - `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented. - `PredictionModel` should be abstract as `_transform` is not implemented. ### Why are the changes needed? To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors. For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or #25776. ### Does this PR introduce any user-facing change? It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same. Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore). It is for discussion if the same should be done to equivalent classes from `ml.wrapper`. If we take `JavaClassifier` as an example, type hierarchy will change from  to  Similarly the old model  will become  ### How was this patch tested? Existing unit tests. Closes #27245 from zero323/SPARK-29212. Authored-by: zero323 <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
### What changes were proposed in this pull request? Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends. Note - `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented. - `PredictionModel` should be abstract as `_transform` is not implemented. ### Why are the changes needed? To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors. For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or apache#25776. ### Does this PR introduce any user-facing change? It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same. Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore). It is for discussion if the same should be done to equivalent classes from `ml.wrapper`. If we take `JavaClassifier` as an example, type hierarchy will change from  to  Similarly the old model  will become  ### How was this patch tested? Existing unit tests. Closes apache#27245 from zero323/SPARK-29212. Authored-by: zero323 <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
What changes were proposed in this pull request?
Add some common classes in Python to make it have the same structure as Scala
This PR makes Python has the following:
This PR makes Python have the following:
This PR makes Python have the following:
Why are the changes needed?
Have parity between Python and Scala ML
Does this PR introduce any user-facing change?
Yes. Add the following changes:
How was this patch tested?
Add a few doc tests.