[pyspark] support pred_contribs #8633

wbo4958 · 2023-01-04T13:19:29Z

To fix #8449. This PR supports pred_contribs for pyspark

wbo4958 · 2023-01-04T13:20:54Z

python-package/xgboost/spark/data.py

@@ -331,3 +334,25 @@ def split_params() -> Tuple[Dict[str, Any], Dict[str, Union[int, float, bool]]]:
        assert dvalid.num_col() == dtrain.num_col()

    return dtrain, dvalid
+
+
+def pred_contribs(


@trivialfis Do you think it's better to move pred_contribs function to XGBModel?

Maybe we can merge it into the XGBModel/XGBClassifier.predict method for consistency?

wow, good suggestion.

@trivialfis can you file a followup PR to merge it into XGBModel/XGBClassifier.predict method

trivialfis · 2023-01-04T19:16:01Z

python-package/xgboost/spark/core.py

+
+                if pred_contrib_col_name:
+                    contribs = pred_contribs(model, X, base_margin)
+                    assert len(contribs.shape) == 2


This is not necessarily true. See doc/prediction.rst.

trivialfis · 2023-01-04T19:16:50Z

python-package/xgboost/spark/core.py

                preds = model.predict(
                    X,
                    base_margin=base_margin,
                    validate_features=False,
                    **predict_params,
                )
-                yield pd.Series(preds)
+                data["prediction"] = pd.Series(preds)


Can it handle multiple prediction types? For instance, normal prediction + contribs at the same time

do you mean handling "normal prediction + contribs" in the same "predict" of XGBModel or in a single pandas udf?

Right now, this PR can handle the latter, can predict the "normal prediction + contribs" in a single pandas udf. If "predict" of XGBModel can support predicting multiple prediction types, I can change it accordingly.

wbo4958 · 2023-01-05T09:45:31Z

@trivialfis This PR is ready for review. Please help to review it. Thx

wbo4958 · 2023-01-05T23:48:49Z

@hcho3 @trivialfis please help to start the CI. Thx

wbo4958 · 2023-01-06T02:16:23Z

@trivialfis seems the failure case is not caused by this PR.

=================================== FAILURES ===================================
--
  | __________________________ TestLinear.test_coordinate __________________________

wbo4958 · 2023-01-08T00:27:23Z

@WeichenXu123 @trivialfis please help to review it. Thx

trivialfis · 2023-01-09T09:52:01Z

python-package/xgboost/spark/core.py

+Pred = namedtuple(
+    "Pred", ("prediction", "raw_prediction", "probability", "pred_contrib")
+)
+pred = Pred("prediction", "rawPrediction", "probability", "pred_contrib")


Do we need to keep a consistent naming scheme? predContrib v.s. pred_constrib, based on the use of rawPrediction.

Good suggestion. Done

trivialfis · 2023-01-09T09:53:40Z

python-package/xgboost/spark/core.py

+            if pred_contrib_col_name:
+                dataset = dataset.withColumn(
+                    pred_contrib_col_name,
+                    array_to_vector(getattr(col(pred_struct_col), pred.pred_contrib)),


Does this work with cuDF?

It seems not, since the prediction udf has not added cudf support.

trivialfis · 2023-01-09T09:54:21Z

python-package/xgboost/spark/core.py

        )
+        if pred_contrib_col_name:
+            # We will force setting strict_shape to True when predicting contribs,


trivialfis · 2023-01-09T09:55:03Z

tests/test_distributed/test_with_spark/test_spark_local.py

+
+
+@pytest.fixture
+def reg_data(spark: SparkSession) -> Generator[RegData, None, None]:


Do we need tests for multi-class as well?

Yeah, good suggestion.

trivialfis · 2023-01-09T09:55:42Z

tests/test_distributed/test_with_spark/test_spark_local.py

+        (Vectors.dense(X[0, :]), int(y[0])),
+        (Vectors.sparse(3, {1: float(X[1, 1]), 2: float(X[1, 2])}), int(y[1])),
+    ]
+    cls_df_train = spark.createDataFrame(reg_df_train_data, ["features", "label"])


cls usually mean classifier.

cls usually mean classifier.

Based on my experience, "cls" is usually used as an abbreviation for "class", while "clf" is used as an abbreviation for "classifier".

no specific request here. it's just an arbitrary convention in this project. Both are used.

WeichenXu123 · 2023-01-09T13:02:54Z

python-package/xgboost/spark/core.py

+                data[pred.prediction] = pd.Series(preds)
+
+                if pred_contrib_col_name:
+                    contribs = pred_contribs(model, X, base_margin)


Q: Possible optimization: can we compute pred_contribs, proba, and prediction in one pass ?

AFAIK, the xgboost API can predict 1 type at one time. @trivialfis can you correct me?

That's correct. However, the Spark API can iterate through multiple types of prediction for meeting the spark convention if needed.

WeichenXu123 · 2023-01-09T13:03:59Z

python-package/xgboost/spark/params.py

+    pred_contrib_col: "Param[str]" = Param(
+        Params._dummy(),
+        "pred_contrib_col",
+        "contribution prediction column name.",


We can explain a bit more about the contribution prediction here.

WeichenXu123 · 2023-01-09T13:05:43Z

python-package/xgboost/spark/params.py

+        "pred_contrib_col",
+        "contribution prediction column name.",
+        typeConverter=TypeConverters.toString,
+    )


We need to add this param into _pyspark_specific_params dict.

wbo4958 commented Jan 4, 2023

View reviewed changes

trivialfis reviewed Jan 4, 2023

View reviewed changes

[PySpark] support pred_contribs

df0a7c4

wbo4958 force-pushed the pred_contribs branch from 716db7e to df0a7c4 Compare January 5, 2023 09:42

wbo4958 marked this pull request as ready for review January 5, 2023 09:43

format

29f374a

Merge remote-tracking branch 'upstream/master' into pred_contribs

b0cb52b

Merge remote-tracking branch 'upstream/master' into pred_contribs

abaa403

trivialfis reviewed Jan 9, 2023

View reviewed changes

WeichenXu123 reviewed Jan 9, 2023

View reviewed changes

fix comments

342ebf9

WeichenXu123 approved these changes Jan 10, 2023

View reviewed changes

trivialfis approved these changes Jan 10, 2023

View reviewed changes

trivialfis merged commit 72ec0c5 into dmlc:master Jan 11, 2023

wbo4958 deleted the pred_contribs branch January 11, 2023 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] support pred_contribs #8633

[pyspark] support pred_contribs #8633

wbo4958 commented Jan 4, 2023

wbo4958 Jan 4, 2023

trivialfis Jan 4, 2023 •

edited

Loading

wbo4958 Jan 4, 2023

wbo4958 Jan 5, 2023

trivialfis Jan 9, 2023

trivialfis Jan 4, 2023

wbo4958 Jan 5, 2023

trivialfis Jan 4, 2023

wbo4958 Jan 5, 2023

wbo4958 commented Jan 5, 2023

wbo4958 commented Jan 5, 2023

wbo4958 commented Jan 6, 2023

wbo4958 commented Jan 8, 2023

trivialfis Jan 9, 2023 •

edited

Loading

wbo4958 Jan 9, 2023

trivialfis Jan 9, 2023

wbo4958 Jan 9, 2023

trivialfis Jan 9, 2023

trivialfis Jan 9, 2023

wbo4958 Jan 9, 2023

trivialfis Jan 9, 2023

candalfigomoro Jan 9, 2023

trivialfis Jan 9, 2023 •

edited

Loading

wbo4958 Jan 9, 2023

WeichenXu123 Jan 9, 2023

wbo4958 Jan 9, 2023

trivialfis Jan 10, 2023

WeichenXu123 Jan 9, 2023

wbo4958 Jan 9, 2023

WeichenXu123 Jan 9, 2023

wbo4958 Jan 9, 2023



		@pytest.fixture
		def reg_data(spark: SparkSession) -> Generator[RegData, None, None]:

[pyspark] support pred_contribs #8633

[pyspark] support pred_contribs #8633

Conversation

wbo4958 commented Jan 4, 2023

Choose a reason for hiding this comment

trivialfis Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Jan 5, 2023

wbo4958 commented Jan 5, 2023

wbo4958 commented Jan 6, 2023

wbo4958 commented Jan 8, 2023

trivialfis Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jan 4, 2023 •

edited

Loading

trivialfis Jan 9, 2023 •

edited

Loading

trivialfis Jan 9, 2023 •

edited

Loading