From 81af88687f97f70b30828ac63239129637852526 Mon Sep 17 00:00:00 2001 From: zhengruifeng3 Date: Sat, 21 Jul 2018 08:26:45 -0500 Subject: [PATCH] [SPARK-23231][ML][DOC] Add doc for string indexer ordering to user guide (also to RFormula guide) ## What changes were proposed in this pull request? add doc for string indexer ordering ## How was this patch tested? existing tests Author: zhengruifeng3 Author: zhengruifeng Closes #21792 from zhengruifeng/doc_string_indexer_ordering. --- docs/ml-features.md | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index ad6e718b37f1b..882b895a9d154 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -585,7 +585,11 @@ for more details on the API. ## StringIndexer `StringIndexer` encodes a string column of labels to a column of label indices. -The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`. +The indices are in `[0, numLabels)`, and four ordering options are supported: +"frequencyDesc": descending order by label frequency (most frequent label assigned 0), +"frequencyAsc": ascending order by label frequency (least frequent label assigned 0), +"alphabetDesc": descending alphabetical order, and "alphabetAsc": ascending alphabetical order +(default = "frequencyDesc"). The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as `Estimator` or @@ -1593,10 +1597,25 @@ Suppose `a` and `b` are double columns, we use the following simple examples to * `y ~ a + b + a:b - 1` means model `y ~ w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients. `RFormula` produces a vector column of features and a double or string column of label. -Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. -If the label column is of type string, it will be first transformed to double with `StringIndexer`. +Like when formulas are used in R for linear regression, numeric columns will be cast to doubles. +As to string input columns, they will first be transformed with [StringIndexer](ml-features.html#stringindexer) using ordering determined by `stringOrderType`, +and the last category after ordering is dropped, then the doubles will be one-hot encoded. + +Suppose a string feature column containing values `{'b', 'a', 'b', 'a', 'c', 'b'}`, we set `stringOrderType` to control the encoding: +~~~ +stringOrderType | Category mapped to 0 by StringIndexer | Category dropped by RFormula +----------------|---------------------------------------|--------------------------------- +'frequencyDesc' | most frequent category ('b') | least frequent category ('c') +'frequencyAsc' | least frequent category ('c') | most frequent category ('b') +'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a') +'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c') +~~~ + +If the label column is of type string, it will be first transformed to double with [StringIndexer](ml-features.html#stringindexer) using `frequencyDesc` ordering. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula. +**Note:** The ordering option `stringOrderType` is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in `StringIndexer`. + **Examples** Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`: