SPARK-1544 Add support for deep decision trees. #475

manishamde · 2014-04-22T04:08:33Z

@etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.

To summarize:

We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.

cc: @atalwalkar, @hirakendu, @mengxr

Parameterizing max memory.

AmplabJenkins · 2014-04-22T04:12:55Z

Merged build triggered.

AmplabJenkins · 2014-04-22T04:13:01Z

Merged build started.

AmplabJenkins · 2014-04-22T04:14:32Z

Merged build finished.

AmplabJenkins · 2014-04-22T04:14:32Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14316/

Fixing scalastyle issue.

AmplabJenkins · 2014-04-22T18:17:55Z

Merged build triggered.

AmplabJenkins · 2014-04-22T18:18:02Z

Merged build started.

AmplabJenkins · 2014-04-22T19:29:03Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-22T19:29:03Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14330/

etrain · 2014-04-23T18:59:03Z

Can one of the admins take a look at this? The Travis CI error seems to be in StreamingContext tests, which have nothing to do with this change.

mengxr · 2014-04-23T19:26:37Z

@etrain We are testing Travis CI. You can simply ignore the build results from it.

AmplabJenkins · 2014-04-24T00:07:55Z

Build triggered.

AmplabJenkins · 2014-04-24T00:08:35Z

Build started.

AmplabJenkins · 2014-04-24T01:39:40Z

Build finished.

AmplabJenkins · 2014-04-24T01:39:41Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14413/

manishamde · 2014-04-28T21:16:35Z

Can somebody please take a look at the PR.

techaddict · 2014-04-29T06:16:02Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

+    }
+    logDebug("numElementsPerNode = " + numElementsPerNode)
+    val arraySizePerNode = 8 * numElementsPerNode // approx. memory usage for bin aggregate array
+    val maxNumberOfNodesPerGroup = scala.math.max(maxMemoryUsage / arraySizePerNode, 1)


why not just use math.max

@techaddict Happy to change it. It is cosmetic or is there something more to it?

@manishamde just cleanliness.

mengxr · 2014-04-29T07:56:29Z

@manishamde Could you try to merge the latest master?

mengxr · 2014-04-29T07:57:07Z

docs/mllib-classification-regression.md

-The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)* in memory for aggregating histograms over partitions. The current implementation might not scale to very deep trees since the memory requirement grows exponentially with tree depth. 
-
-Please drop us a line if you encounter any issues. We are planning to solve this problem in the near future and real-world examples will be great.
+### Implementation Details


FYI, the decision tree guide is now in mllib-decision-tree.md.

AmplabJenkins · 2014-04-29T21:47:57Z

Merged build triggered.

AmplabJenkins · 2014-04-29T21:48:07Z

Merged build started.

manishamde · 2014-05-06T17:30:10Z

@mengxr Thanks! I fixed the two code style issues.

AmplabJenkins · 2014-05-06T17:32:58Z

Merged build triggered.

AmplabJenkins · 2014-05-06T17:33:06Z

Merged build started.

AmplabJenkins · 2014-05-06T19:03:48Z

Merged build finished.

AmplabJenkins · 2014-05-06T19:03:48Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14724/

mengxr · 2014-05-07T16:56:15Z

@manishamde Could you add docs for numGroups and groupIndex to findBestSplitsPerGroup?

manishamde · 2014-05-07T17:29:38Z

@mengxr Sorry, escaped my attention. I ended up adding more documentation
in the vicinity. :-) Will fix shortly.

On Wed, May 7, 2014 at 9:56 AM, Xiangrui Meng [email protected]:

@manishamde https://github.com/manishamde Could you add docs for
numGroups and groupIndex to findBestSplitsPerGroup?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/475#issuecomment-42453279
.

AmplabJenkins · 2014-05-07T17:47:58Z

Build triggered.

AmplabJenkins · 2014-05-07T17:48:04Z

Build started.

manishamde · 2014-05-07T17:49:58Z

@mengxr done!

AmplabJenkins · 2014-05-07T19:04:18Z

Build finished. All automated tests passed.

AmplabJenkins · 2014-05-07T19:04:19Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14778/

mengxr · 2014-05-07T20:58:34Z

LGTM. Thanks!

pwendell · 2014-05-07T23:02:33Z

@manishamde this needs to be brought up to master - would you mind merging it?

manishamde · 2014-05-07T23:13:39Z

Sure!

On Wed, May 7, 2014 at 4:02 PM, Patrick Wendell [email protected]:

@manishamde https://github.com/manishamde this needs to be brought up
to master - would you mind merging it?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/475#issuecomment-42494167
.

AmplabJenkins · 2014-05-07T23:22:59Z

Merged build triggered.

AmplabJenkins · 2014-05-07T23:23:06Z

Merged build started.

AmplabJenkins · 2014-05-07T23:58:19Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-07T23:58:20Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14792/

@etrain

@etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels. To summarize: 1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver). 2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth. cc: @atalwalkar, @hirakendu, @mengxr Author: Manish Amde <[email protected]> Author: manishamde <[email protected]> Author: Evan Sparks <[email protected]> Closes #475 from manishamde/deep_tree and squashes the following commits: 968ca9d [Manish Amde] merged master 7fc9545 [Manish Amde] added docs ce004a1 [Manish Amde] minor formatting b27ad2c [Manish Amde] formatting 426bb28 [Manish Amde] programming guide blurb 8053fed [Manish Amde] more formatting 5eca9e4 [Manish Amde] grammar 4731cda [Manish Amde] formatting 5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation cbd9f14 [Manish Amde] modified scala.math to math dad9652 [Manish Amde] removed unused imports e0426ee [Manish Amde] renamed parameter 718506b [Manish Amde] added unit test 1517155 [Manish Amde] updated documentation 9dbdabe [Manish Amde] merge from master 719d009 [Manish Amde] updating user documentation fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree 0287772 [Evan Sparks] Fixing scalastyle issue. 2f1e093 [Manish Amde] minor: added doc for maxMemory parameter 2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree abc5a23 [Evan Sparks] Parameterizing max memory. 50b143a [Manish Amde] adding support for very deep trees (cherry picked from commit f269b01) Signed-off-by: Patrick Wendell <[email protected]>

@etrain

@etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels. To summarize: 1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver). 2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth. cc: @atalwalkar, @hirakendu, @mengxr Author: Manish Amde <[email protected]> Author: manishamde <[email protected]> Author: Evan Sparks <[email protected]> Closes apache#475 from manishamde/deep_tree and squashes the following commits: 968ca9d [Manish Amde] merged master 7fc9545 [Manish Amde] added docs ce004a1 [Manish Amde] minor formatting b27ad2c [Manish Amde] formatting 426bb28 [Manish Amde] programming guide blurb 8053fed [Manish Amde] more formatting 5eca9e4 [Manish Amde] grammar 4731cda [Manish Amde] formatting 5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation cbd9f14 [Manish Amde] modified scala.math to math dad9652 [Manish Amde] removed unused imports e0426ee [Manish Amde] renamed parameter 718506b [Manish Amde] added unit test 1517155 [Manish Amde] updated documentation 9dbdabe [Manish Amde] merge from master 719d009 [Manish Amde] updating user documentation fecf89a [manishamde] Merge pull request apache#6 from etrain/deep_tree 0287772 [Evan Sparks] Fixing scalastyle issue. 2f1e093 [Manish Amde] minor: added doc for maxMemory parameter 2f6072c [manishamde] Merge pull request apache#5 from etrain/deep_tree abc5a23 [Evan Sparks] Parameterizing max memory. 50b143a [Manish Amde] adding support for very deep trees

* Set ENV_DRIVER_MEMORY to memory instead of memory+overhead Signed-off-by: duyanghao <[email protected]> * Restore test

https://issues.apache.org/jira/browse/SPARK-26626 apache#23556 ## What changes were proposed in this pull request? This adds a `spark.sql.maxRepeatedAliasSize` config option, which specifies the maximum size of an aliased expression to be substituted (in CollapseProject and PhysicalOperation). This prevents large aliased expressions from being substituted multiple times and exploding the size of the expression tree, eventually OOMing the driver. The default config value of 100 was chosen through testing to find the optimally performant value: ![image](https://user-images.githubusercontent.com/17480705/51204201-dd285300-18b7-11e9-8781-dd698df00389.png) ## How was this patch tested? Added unit tests, and did manual testing

Refactor for periodic pipeline and job

… incorrect result (apache#475) (apache#480)

manishamde and others added 4 commits April 20, 2014 13:33

adding support for very deep trees

50b143a

Parameterizing max memory.

abc5a23

Merge pull request #5 from etrain/deep_tree

2f6072c

Parameterizing max memory.

minor: added doc for maxMemory parameter

2f1e093

etrain and others added 2 commits April 22, 2014 11:13

Fixing scalastyle issue.

0287772

Merge pull request #6 from etrain/deep_tree

fecf89a

Fixing scalastyle issue.

updating user documentation

719d009

techaddict reviewed Apr 29, 2014
View reviewed changes

mengxr reviewed Apr 29, 2014
View reviewed changes

manishamde added 2 commits April 29, 2014 14:43

merge from master

9dbdabe

updated documentation

1517155

minor formatting

ce004a1

added docs

7fc9545

merged master

968ca9d

asfgit closed this in f269b01 May 8, 2014

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#475 from theopenlab/refactor-periodic-pipeline

c73fea7

Refactor for periodic pipeline and job

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

[SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns…

a3852ef

… incorrect result (apache#475) (apache#480)

SPARK-1544 Add support for deep decision trees. #475

SPARK-1544 Add support for deep decision trees. #475

Conversation

manishamde commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

etrain commented Apr 23, 2014

mengxr commented Apr 23, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

manishamde commented Apr 28, 2014

techaddict Apr 29, 2014

Choose a reason for hiding this comment

manishamde May 1, 2014

Choose a reason for hiding this comment

techaddict May 1, 2014

Choose a reason for hiding this comment

mengxr commented Apr 29, 2014

mengxr Apr 29, 2014

Choose a reason for hiding this comment

manishamde Apr 29, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

manishamde commented May 6, 2014

AmplabJenkins commented May 6, 2014

AmplabJenkins commented May 6, 2014

AmplabJenkins commented May 6, 2014

AmplabJenkins commented May 6, 2014

mengxr commented May 7, 2014

manishamde commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

manishamde commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

mengxr commented May 7, 2014

pwendell commented May 7, 2014

manishamde commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014