Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) #2063

Closed
wants to merge 6 commits into from

Conversation

jkbradley
Copy link
Member

Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr @manishamde @atalwalkar

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 2063 at commit b9bee04.

  • This patch merges cleanly.

JavaSparkContext sc = new JavaSparkContext(sparkConf);

String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache the data before computation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cached by tree training, but should we cache it here too since it used again for testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cached the binned features in training. But in this example, we visit the raw features twice. Since it is reading from disk, it should help if we cache the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only calculating numClasses in the Java example. Should we eliminate it since it makes the already verbose Java code even more verbose? Else, we need to make the same change to the Scala and Python examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to cache, let's note why we are doing it via a comment. Else, some users might get confused and decide to cache always before calling the tree algorithm.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 2063 at commit b9bee04.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • is used for ordering. In multiclass classification, all$2^`
    • public final class JavaDecisionTree

feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble
algorithms such as decision forest and boosting are among the top performers for classification and
algorithms such as decision forests and boosting are among the top performers for classification and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just call them random forests instead? :-)

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 2063 at commit d802369.

  • This patch merges cleanly.

@manishamde
Copy link
Contributor

Thanks @jkbradley

I had some minor comments that I have noted above. LGTM!

@jkbradley
Copy link
Member Author

@mengxr @manishamde Thanks for the feedback! I believe I've addressed all of the comments.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 2063 at commit 9dd1b6b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 2063 at commit d802369.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 2063 at commit 9dd1b6b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • In multiclass classification, all$2^`
    • public final class JavaDecisionTree

binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
For a categorical feature with `$M$` possible values (categories), one could come up with
`$2^{M-1}-1$` split candidates. For binary classification and regression,
we can reduce the number of split candidates to `$M-1$` by ordering the
categorical feature values by the proportion of labels falling in one of the two classes (see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is specific to binary classification, though I think it's supposed to explain a strategy that's applicable to both binary classification and regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@jkbradley
Copy link
Member Author

@atalwalkar Thanks for the comments! I believe I've fixed the issues.

@SparkQA
Copy link

SparkQA commented Aug 21, 2014

QA tests have started for PR 2063 at commit 2dd2c19.

  • This patch merges cleanly.

@atalwalkar
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Aug 21, 2014

QA tests have finished for PR 2063 at commit 2dd2c19.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 21, 2014

I've merged this into master and branch-1.1. Thanks!!

asfgit pushed a commit that referenced this pull request Aug 21, 2014
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr  @manishamde @atalwalkar

Author: Joseph K. Bradley <[email protected]>

Closes #2063 from jkbradley/dt-docs and squashes the following commits:

2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation.  Added Java, Python examples.

(cherry picked from commit 050f8d0)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in 050f8d0 Aug 21, 2014
@jkbradley jkbradley deleted the dt-docs branch August 26, 2014 17:41
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr  @manishamde @atalwalkar

Author: Joseph K. Bradley <[email protected]>

Closes apache#2063 from jkbradley/dt-docs and squashes the following commits:

2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation.  Added Java, Python examples.
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
…pache#2063)

This PR moves `loadS3Authz` to `SparkContext` so that the unified auth
can be effective for `SparkContext` use cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants