Skip to content

Commit

Permalink
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature select…
Browse files Browse the repository at this point in the history
…ion docs for ChiSqSelector

## What changes were proposed in this pull request?

A follow up for apache#14597 to update feature selection docs about ChiSqSelector.

## How was this patch tested?

Generated html docs. It can be previewed at:

* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector

Author: Shuai Lin <[email protected]>

Closes apache#15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
  • Loading branch information
lins05 authored and srowen committed Sep 28, 2016
1 parent 4a83395 commit b2a7eed
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 8 deletions.
14 changes: 10 additions & 4 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1331,10 +1331,16 @@ for more details on the API.
## ChiSqSelector

`ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with
categorical features. ChiSqSelector orders features based on a
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test)
from the class, and then filters (selects) the top features which the class label depends on the
most. This is akin to yielding the features with the most predictive power.
categorical features. ChiSqSelector uses the
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.

**Examples**

Expand Down
14 changes: 10 additions & 4 deletions docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,10 +225,16 @@ features for use in model construction. It reduces the size of the feature space
both speed and statistical learning behavior.

[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements
Chi-Squared feature selection. It operates on labeled data with categorical features.
`ChiSqSelector` orders features based on a Chi-Squared test of independence from the class,
and then filters (selects) the top features which the class label depends on the most.
This is akin to yielding the features with the most predictive power.
Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.

The number of features to select can be tuned using a held-out validation set.

Expand Down

0 comments on commit b2a7eed

Please sign in to comment.