Skip to content

Commit

Permalink
[SPARK-23231][ML][DOC] Add doc for string indexer ordering to user gu…
Browse files Browse the repository at this point in the history
…ide (also to RFormula guide)

## What changes were proposed in this pull request?
add doc for string indexer ordering

## How was this patch tested?
existing tests

Author: zhengruifeng3 <[email protected]>
Author: zhengruifeng <[email protected]>

Closes #21792 from zhengruifeng/doc_string_indexer_ordering.
  • Loading branch information
zhengruifeng3 authored and srowen committed Jul 21, 2018
1 parent d7ae424 commit 81af886
Showing 1 changed file with 22 additions and 3 deletions.
25 changes: 22 additions & 3 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,11 @@ for more details on the API.
## StringIndexer

`StringIndexer` encodes a string column of labels to a column of label indices.
The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
The indices are in `[0, numLabels)`, and four ordering options are supported:
"frequencyDesc": descending order by label frequency (most frequent label assigned 0),
"frequencyAsc": ascending order by label frequency (least frequent label assigned 0),
"alphabetDesc": descending alphabetical order, and "alphabetAsc": ascending alphabetical order
(default = "frequencyDesc").
The unseen labels will be put at index numLabels if user chooses to keep them.
If the input column is numeric, we cast it to string and index the string
values. When downstream pipeline components such as `Estimator` or
Expand Down Expand Up @@ -1593,10 +1597,25 @@ Suppose `a` and `b` are double columns, we use the following simple examples to
* `y ~ a + b + a:b - 1` means model `y ~ w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients.

`RFormula` produces a vector column of features and a double or string column of label.
Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles.
If the label column is of type string, it will be first transformed to double with `StringIndexer`.
Like when formulas are used in R for linear regression, numeric columns will be cast to doubles.
As to string input columns, they will first be transformed with [StringIndexer](ml-features.html#stringindexer) using ordering determined by `stringOrderType`,
and the last category after ordering is dropped, then the doubles will be one-hot encoded.

Suppose a string feature column containing values `{'b', 'a', 'b', 'a', 'c', 'b'}`, we set `stringOrderType` to control the encoding:
~~~
stringOrderType | Category mapped to 0 by StringIndexer | Category dropped by RFormula
----------------|---------------------------------------|---------------------------------
'frequencyDesc' | most frequent category ('b') | least frequent category ('c')
'frequencyAsc' | least frequent category ('c') | most frequent category ('b')
'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a')
'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c')
~~~

If the label column is of type string, it will be first transformed to double with [StringIndexer](ml-features.html#stringindexer) using `frequencyDesc` ordering.
If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.

**Note:** The ordering option `stringOrderType` is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in `StringIndexer`.

**Examples**

Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`:
Expand Down

0 comments on commit 81af886

Please sign in to comment.