-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer #10085
[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer #10085
Conversation
…ectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure
… a param, print out the splits from the trained bucketizer
Test build #47026 has finished for PR 10085 at commit
|
Test build #47030 has finished for PR 10085 at commit
|
Test build #47031 has finished for PR 10085 at commit
|
cc @yanboliang who filed the JIRA for this. |
Can you please only link to the specific JIRA, not the umbrella? |
OK, I was not realized that there is an umbrella JIRA for this. I'll close mine. |
@holdenk For your questions, I first tried to modify the interface of
|
I vote for making |
@yanboliang I am OK to change |
Ok - just to make sure do you see any issues with the current approach for getSplits? Its tested a bit in the doctests but if there is a potential issue I can add some more tests. |
@holdenk No more issue in getSplits. It looks good. |
@yinxusen thanks :) |
cc @yanboliang if you have a chance to take a look |
…feature.QuantileDiscretizer
re-ping @yanboliang or @jkbradley if you've got the time to look at this (already been reviewed a bit). |
Test build #48495 has finished for PR 10085 at commit
|
…feature.QuantileDiscretizer
Test build #49177 has finished for PR 10085 at commit
|
re-ping @jkbradley ? |
I'll try to check this soon, but have some others first. It will be great if someone else can review this PR in the meantime. @yinxusen Would you have time? Thanks! |
@jkbradley I'll help you reviewing this. |
# a placeholder to make it appear in the generated doc | ||
numBuckets = Param(Params._dummy(), "numBuckets", | ||
"Maximum number of buckets (quantiles, or " + | ||
"categories) into which data points are grouped. Must be >= 2.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a default 2
here?
>>> bucketed[0].buckets | ||
0.0 | ||
|
||
.. versionadded:: 1.6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to 2.0.0.
@jkbradley LGTM except for the version labels. |
@yinxusen /@jkbradley updated the version added tag to 2.0.0 :) |
Test build #49534 has finished for PR 10085 at commit
|
LGTM as well! Thanks. |
@@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol): | |||
"specified will be treated as errors.") | |||
|
|||
@keyword_only | |||
def __init__(self, splits=None, inputCol=None, outputCol=None): | |||
def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is _java_model needed? It does not seem to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yah, I think the original plan was to avoid the overhead of object creation and sending the params back to the JVM if it is supplied since we already had a transformer. I'll remove this.
Those are the only issues I see. Thanks everyone for reviewing & @holdenk for the PR! |
Test build #49694 has finished for PR 10085 at commit
|
…feature.QuantileDiscretizer
Test build #49719 has finished for PR 10085 at commit
|
Test build #49726 has finished for PR 10085 at commit
|
Think I addressed all of @jkbradley's comments |
…feature.QuantileDiscretizer
Test build #50015 has finished for PR 10085 at commit
|
seems unrelated, jenkins retest this please. |
Test build #50037 has finished for PR 10085 at commit
|
LGTM |
Add Python API for ml.feature.QuantileDiscretizer.
One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc @brkyvz & @mengxr