Skip to content

Commit

Permalink
Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

Revert 397b843 and 5a48eb8

### Why are the changes needed?

As discussed in #33800 (comment), there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
gengliangwang authored and dongjoon-hyun committed Aug 24, 2021
1 parent d6c453a commit de932f5
Show file tree
Hide file tree
Showing 11 changed files with 3 additions and 871 deletions.
36 changes: 1 addition & 35 deletions docs/ml-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,44 +71,10 @@ for multiclass problems, a [`MultilabelClassificationEvaluator`](api/scala/org/a
[`RankingEvaluator`](api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html) for ranking problems. The default metric used to
choose the best `ParamMap` can be overridden by the `setMetricName` method in each of these evaluators.

To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility (see the *Cross-Validation* section below for an example).
To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility.
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.

Alternatively, users can use the [`ParamRandomBuilder`](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) utility.
This has the same properties of `ParamGridBuilder` mentioned above, but hyperparameters are chosen at random within a user-defined range.
The mathematical principle behind this is that given enough samples, the probability of at least one sample *not* being near the optimum within a range tends to zero.
Irrespective of machine learning model, the expected number of samples needed to have at least one within 5% of the optimum is about 60.
If this 5% volume lies between the parameters defined in a grid search, it will *never* be found by `ParamGridBuilder`.

<div class="codetabs">

<div data-lang="scala" markdown="1">

Refer to the [`ParamRandomBuilder` Scala docs](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.

{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala %}
</div>

<div data-lang="java" markdown="1">

Refer to the [`ParamRandomBuilder` Java docs](api/java/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java %}
</div>

<div data-lang="python" markdown="1">

Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt.

Refer to the [`ParamRandomBuilder` Java docs](api/python/reference/api/pyspark.ml.tuning.ParamRandomBuilder.html) for details on the API.

{% include_example python/ml/model_selection_random_hyperparameters_example.py %}

</div>

</div>

# Cross-Validation

`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

Loading

0 comments on commit de932f5

Please sign in to comment.