[SPARK-29645][ML][PYSPARK] ML add param RelativeError #26305

zhengruifeng · 2019-10-30T04:08:33Z

What changes were proposed in this pull request?

1, add shared param relativeError
2, Imputer/RobusterScaler/QuantileDiscretizer extend HasRelativeError

Why are the changes needed?

It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead.
QuantileDiscretizer had already added this param, while other algs not yet.

Does this PR introduce any user-facing change?

yes, new param is added in Imputer/RobusterScaler

How was this patch tested?

existing testsutes

SparkQA · 2019-10-30T04:27:24Z

Test build #112882 has finished for PR 26305 at commit d057403.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-30T06:08:06Z

Test build #112890 has finished for PR 26305 at commit 3fd1892.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-10-30T09:32:57Z

retest this please

SparkQA · 2019-10-30T12:05:36Z

Test build #112911 has finished for PR 26305 at commit 3fd1892.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2019-10-30T18:54:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

-   * Relative error (see documentation for
-   * `org.apache.spark.sql.DataFrameStatFunctions.approxQuantile` for description)
-   * Must be in the range [0, 1].
-   * Note that in multiple columns case, relative error is applied to all columns.


Nit: Seems the above line got removed in the new documentation. I guess maybe put it somewhere else in the doc? Maybe put it in the end of line 97?

Since 2.3.0, `QuantileDiscretizer` can map multiple columns at once by setting the `inputCols` parameter. If both of the `inputCol` and `inputCols` parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the `numBucketsArray` parameter can be set, or if the number of buckets should be the same across columns, `numBuckets` can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

huaxingao · 2019-10-30T18:55:36Z

LGTM

SparkQA · 2019-10-31T05:18:25Z

Test build #112987 has finished for PR 26305 at commit 5689c1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-10-31T05:55:02Z

Merged to master, thanks @huaxingao for reviewing!

zhengruifeng added 3 commits October 30, 2019 11:39

create pr

e903d26

fix conflicts

334f3b3

nit

d057403

zhengruifeng added ML PYSPARK labels Oct 30, 2019

fix mima

3fd1892

huaxingao reviewed Oct 30, 2019

View reviewed changes

update doc

5689c1e

zhengruifeng closed this in bb47870 Oct 31, 2019

zhengruifeng deleted the add_relative_err branch October 31, 2019 05:55

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29645][ML][PYSPARK] ML add param RelativeError #26305

[SPARK-29645][ML][PYSPARK] ML add param RelativeError #26305

zhengruifeng commented Oct 30, 2019

SparkQA commented Oct 30, 2019

SparkQA commented Oct 30, 2019

zhengruifeng commented Oct 30, 2019

SparkQA commented Oct 30, 2019

huaxingao Oct 30, 2019

huaxingao commented Oct 30, 2019

SparkQA commented Oct 31, 2019

zhengruifeng commented Oct 31, 2019

[SPARK-29645][ML][PYSPARK] ML add param RelativeError #26305

[SPARK-29645][ML][PYSPARK] ML add param RelativeError #26305

Conversation

zhengruifeng commented Oct 30, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 30, 2019

SparkQA commented Oct 30, 2019

zhengruifeng commented Oct 30, 2019

SparkQA commented Oct 30, 2019

huaxingao Oct 30, 2019

Choose a reason for hiding this comment

huaxingao commented Oct 30, 2019

SparkQA commented Oct 31, 2019

zhengruifeng commented Oct 31, 2019