Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"

### What changes were proposed in this pull request? Revert 397b843 and 5a48eb8 ### Why are the changes needed? As discussed in #33800 (comment), there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci tests Closes #33819 from gengliangwang/revert-SPARK-34415. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
apache · Aug 24, 2021 · de932f5 · de932f5
1 parent d6c453a
commit de932f5
Show file tree

Hide file tree

Showing 11 changed files with 3 additions and 871 deletions.
diff --git a/docs/ml-tuning.md b/docs/ml-tuning.md
@@ -71,44 +71,10 @@ for multiclass problems, a [`MultilabelClassificationEvaluator`](api/scala/org/a
 [`RankingEvaluator`](api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html) for ranking problems. The default metric used to
 choose the best `ParamMap` can be overridden by the `setMetricName` method in each of these evaluators.
 
-To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility (see the *Cross-Validation* section below for an example).
+To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility.
 By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
 The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance.  Generally speaking, a value up to 10 should be sufficient for most clusters.
 
-Alternatively, users can use the [`ParamRandomBuilder`](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) utility.
-This has the same properties of `ParamGridBuilder` mentioned above, but hyperparameters are chosen at random within a user-defined range.
-The mathematical principle behind this is that given enough samples, the probability of at least one sample *not* being near the optimum within a range tends to zero.
-Irrespective of machine learning model, the expected number of samples needed to have at least one within 5% of the optimum is about 60. 
-If this 5% volume lies between the parameters defined in a grid search, it will *never* be found by `ParamGridBuilder`.  
-
-<div class="codetabs">
-
-<div data-lang="scala" markdown="1">
-
-Refer to the [`ParamRandomBuilder` Scala docs](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.
-
-{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-Refer to the [`ParamRandomBuilder` Java docs](api/java/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.
-
-{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java %}
-</div>
-
-<div data-lang="python" markdown="1">
-
-Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt.  
-
-Refer to the [`ParamRandomBuilder` Java docs](api/python/reference/api/pyspark.ml.tuning.ParamRandomBuilder.html) for details on the API.
-
-{% include_example python/ml/model_selection_random_hyperparameters_example.py %}
-
-</div>
-
-</div>
-
 # Cross-Validation
 
 `CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.  To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.

diff --git a/.../java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java b/.../java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java
diff --git a/examples/src/main/python/ml/model_selection_random_hyperparameters_example.py b/examples/src/main/python/ml/model_selection_random_hyperparameters_example.py
diff --git a/...in/scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala b/...in/scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala