[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063

jkbradley · 2014-08-20T18:36:12Z

Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr @manishamde @atalwalkar

…doc example as needed.

SparkQA · 2014-08-20T18:40:43Z

QA tests have started for PR 2063 at commit b9bee04.

This patch merges cleanly.

mengxr · 2014-08-20T18:57:39Z

docs/mllib-decision-tree.md

+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+String datapath = "data/mllib/sample_libsvm_data.txt";
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();


cache the data before computation?

It is cached by tree training, but should we cache it here too since it used again for testing?

We cached the binned features in training. But in this example, we visit the raw features twice. Since it is reading from disk, it should help if we cache the data.

We are only calculating numClasses in the Java example. Should we eliminate it since it makes the already verbose Java code even more verbose? Else, we need to make the same change to the Scala and Python examples.

If we decide to cache, let's note why we are doing it via a comment. Else, some users might get confused and decide to cache always before calling the tree algorithm.

SparkQA · 2014-08-20T19:35:26Z

QA tests have finished for PR 2063 at commit b9bee04.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- is used for ordering. In multiclass classification, all$2^`
- public final class JavaDecisionTree

manishamde · 2014-08-20T21:00:24Z

docs/mllib-decision-tree.md

 feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble
-algorithms such as decision forest and boosting are among the top performers for classification and
+algorithms such as decision forests and boosting are among the top performers for classification and


should we just call them random forests instead? :-)

SparkQA · 2014-08-20T21:20:37Z

QA tests have started for PR 2063 at commit d802369.

This patch merges cleanly.

manishamde · 2014-08-20T21:24:48Z

Thanks @jkbradley

I had some minor comments that I have noted above. LGTM!

jkbradley · 2014-08-20T21:55:25Z

@mengxr @manishamde Thanks for the feedback! I believe I've addressed all of the comments.

SparkQA · 2014-08-20T22:00:53Z

QA tests have started for PR 2063 at commit 9dd1b6b.

This patch merges cleanly.

SparkQA · 2014-08-20T22:13:49Z

QA tests have finished for PR 2063 at commit d802369.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-20T22:55:37Z

QA tests have finished for PR 2063 at commit 9dd1b6b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- In multiclass classification, all$2^`
- public final class JavaDecisionTree

atalwalkar · 2014-08-20T23:01:01Z

docs/mllib-decision-tree.md

-binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
+For a categorical feature with `$M$` possible values (categories), one could come up with
+`$2^{M-1}-1$` split candidates. For binary classification and regression,
+we can reduce the number of split candidates to `$M-1$` by ordering the
 categorical feature values by the proportion of labels falling in one of the two classes (see


This explanation is specific to binary classification, though I think it's supposed to explain a strategy that's applicable to both binary classification and regression.

jkbradley · 2014-08-21T00:07:08Z

@atalwalkar Thanks for the comments! I believe I've fixed the issues.

SparkQA · 2014-08-21T00:11:58Z

QA tests have started for PR 2063 at commit 2dd2c19.

This patch merges cleanly.

atalwalkar · 2014-08-21T00:34:17Z

LGTM

SparkQA · 2014-08-21T01:06:53Z

QA tests have finished for PR 2063 at commit 2dd2c19.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-21T07:19:05Z

I've merged this into master and branch-1.1. Thanks!!

@mengxr

Updated DecisionTree documentation, with examples for Java, Python. Added same Java example to code as well. CC: @mengxr @manishamde @atalwalkar Author: Joseph K. Bradley <[email protected]> Closes #2063 from jkbradley/dt-docs and squashes the following commits: 2dd2c19 [Joseph K. Bradley] Last updates based on github review. 9dd1b6b [Joseph K. Bradley] Updated decision tree doc. d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text. b9bee04 [Joseph K. Bradley] Updated DT examples 57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed. d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples. (cherry picked from commit 050f8d0) Signed-off-by: Xiangrui Meng <[email protected]>

@mengxr

Updated DecisionTree documentation, with examples for Java, Python. Added same Java example to code as well. CC: @mengxr @manishamde @atalwalkar Author: Joseph K. Bradley <[email protected]> Closes apache#2063 from jkbradley/dt-docs and squashes the following commits: 2dd2c19 [Joseph K. Bradley] Last updates based on github review. 9dd1b6b [Joseph K. Bradley] Updated decision tree doc. d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text. b9bee04 [Joseph K. Bradley] Updated DT examples 57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed. d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples.

…pache#2063) This PR moves `loadS3Authz` to `SparkContext` so that the unified auth can be effective for `SparkContext` use cases

jkbradley added 3 commits August 19, 2014 21:02

Updated DecisionTree documentation. Added Java, Python examples.

d939a92

Created JavaDecisionTree example from example in docs, and corrected …

57eee9f

…doc example as needed.

Updated DT examples

b9bee04

mengxr reviewed Aug 20, 2014
View reviewed changes

manishamde reviewed Aug 20, 2014
View reviewed changes

Updates based on comments: cache data, corrected doc text.

d802369

Updated decision tree doc.

9dd1b6b

atalwalkar reviewed Aug 20, 2014
View reviewed changes

Last updates based on github review.

2dd2c19

asfgit closed this in 050f8d0 Aug 21, 2014

jkbradley deleted the dt-docs branch August 26, 2014 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063

[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063

jkbradley commented Aug 20, 2014

SparkQA commented Aug 20, 2014

mengxr Aug 20, 2014

jkbradley Aug 20, 2014

mengxr Aug 20, 2014

manishamde Aug 20, 2014

manishamde Aug 20, 2014

SparkQA commented Aug 20, 2014

manishamde Aug 20, 2014

SparkQA commented Aug 20, 2014

manishamde commented Aug 20, 2014

jkbradley commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

atalwalkar Aug 20, 2014

manishamde Aug 20, 2014

jkbradley commented Aug 21, 2014

SparkQA commented Aug 21, 2014

atalwalkar commented Aug 21, 2014

SparkQA commented Aug 21, 2014

mengxr commented Aug 21, 2014

[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063

[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063

Conversation

jkbradley commented Aug 20, 2014

SparkQA commented Aug 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2014

manishamde commented Aug 20, 2014

jkbradley commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Aug 21, 2014

SparkQA commented Aug 21, 2014

atalwalkar commented Aug 21, 2014

SparkQA commented Aug 21, 2014

mengxr commented Aug 21, 2014