-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063
Conversation
QA tests have started for PR 2063 at commit
|
JavaSparkContext sc = new JavaSparkContext(sparkConf); | ||
|
||
String datapath = "data/mllib/sample_libsvm_data.txt"; | ||
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache the data before computation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is cached by tree training, but should we cache it here too since it used again for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cached the binned features in training. But in this example, we visit the raw features twice. Since it is reading from disk, it should help if we cache the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only calculating numClasses in the Java example. Should we eliminate it since it makes the already verbose Java code even more verbose? Else, we need to make the same change to the Scala and Python examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to cache, let's note why we are doing it via a comment. Else, some users might get confused and decide to cache always before calling the tree algorithm.
QA tests have finished for PR 2063 at commit
|
feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble | ||
algorithms such as decision forest and boosting are among the top performers for classification and | ||
algorithms such as decision forests and boosting are among the top performers for classification and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just call them random forests instead? :-)
QA tests have started for PR 2063 at commit
|
Thanks @jkbradley I had some minor comments that I have noted above. LGTM! |
@mengxr @manishamde Thanks for the feedback! I believe I've addressed all of the comments. |
QA tests have started for PR 2063 at commit
|
QA tests have finished for PR 2063 at commit
|
QA tests have finished for PR 2063 at commit
|
binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the | ||
For a categorical feature with `$M$` possible values (categories), one could come up with | ||
`$2^{M-1}-1$` split candidates. For binary classification and regression, | ||
we can reduce the number of split candidates to `$M-1$` by ordering the | ||
categorical feature values by the proportion of labels falling in one of the two classes (see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation is specific to binary classification, though I think it's supposed to explain a strategy that's applicable to both binary classification and regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
@atalwalkar Thanks for the comments! I believe I've fixed the issues. |
QA tests have started for PR 2063 at commit
|
LGTM |
QA tests have finished for PR 2063 at commit
|
I've merged this into master and branch-1.1. Thanks!! |
Updated DecisionTree documentation, with examples for Java, Python. Added same Java example to code as well. CC: @mengxr @manishamde @atalwalkar Author: Joseph K. Bradley <[email protected]> Closes #2063 from jkbradley/dt-docs and squashes the following commits: 2dd2c19 [Joseph K. Bradley] Last updates based on github review. 9dd1b6b [Joseph K. Bradley] Updated decision tree doc. d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text. b9bee04 [Joseph K. Bradley] Updated DT examples 57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed. d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples. (cherry picked from commit 050f8d0) Signed-off-by: Xiangrui Meng <[email protected]>
Updated DecisionTree documentation, with examples for Java, Python. Added same Java example to code as well. CC: @mengxr @manishamde @atalwalkar Author: Joseph K. Bradley <[email protected]> Closes apache#2063 from jkbradley/dt-docs and squashes the following commits: 2dd2c19 [Joseph K. Bradley] Last updates based on github review. 9dd1b6b [Joseph K. Bradley] Updated decision tree doc. d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text. b9bee04 [Joseph K. Bradley] Updated DT examples 57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed. d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples.
…pache#2063) This PR moves `loadS3Authz` to `SparkContext` so that the unified auth can be effective for `SparkContext` use cases
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr @manishamde @atalwalkar