Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29645][ML][PYSPARK] ML add param RelativeError #26305

Closed
wants to merge 5 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, add shared param relativeError
2, Imputer/RobusterScaler/QuantileDiscretizer extend HasRelativeError

Why are the changes needed?

It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead.
QuantileDiscretizer had already added this param, while other algs not yet.

Does this PR introduce any user-facing change?

yes, new param is added in Imputer/RobusterScaler

How was this patch tested?

existing testsutes

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112882 has finished for PR 26305 at commit d057403.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112890 has finished for PR 26305 at commit 3fd1892.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112911 has finished for PR 26305 at commit 3fd1892.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Relative error (see documentation for
* `org.apache.spark.sql.DataFrameStatFunctions.approxQuantile` for description)
* Must be in the range [0, 1].
* Note that in multiple columns case, relative error is applied to all columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Seems the above line got removed in the new documentation. I guess maybe put it somewhere else in the doc? Maybe put it in the end of line 97?

Since 2.3.0, `QuantileDiscretizer` can map multiple columns at once by setting the `inputCols` parameter. If both of the `inputCol` and `inputCols` parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the `numBucketsArray` parameter can be set, or if the number of buckets should be the same across columns, `numBuckets` can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

@huaxingao
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #112987 has finished for PR 26305 at commit 5689c1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks @huaxingao for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants