[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246

NathanHowell · 2015-08-17T18:59:24Z

Reimplement DecisionTree.findSplitsBins via RDD to parallelize bin calculation.

With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.

NathanHowell · 2015-08-18T17:14:29Z

cc/ @jkbradley @manishamde @mengxr

jkbradley · 2015-08-18T17:41:11Z

This sounds great, thanks! I'll need to finish up with QA for 1.5 before taking a look, but please ping me if I don't return to review before long.

NathanHowell · 2015-08-18T17:46:06Z

Will do, thanks.

manishamde · 2015-08-28T19:23:42Z

Thanks @NathanHowell Sorry for not responding earlier. Will try to review soon.

manishamde · 2015-08-28T22:14:04Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

+      val bins = {
+        val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+        val highSplit = new DummyHighSplit(featureIndex, Continuous)
+        (lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {


A comment here will be helpful.

jkbradley · 2015-09-10T17:07:07Z

@NathanHowell I just looked this over, and it seems fine. A few questions:

Did you encounter this problem in practice? I would have thought that choosing splits would take much less time than training itself. Was training itself fast for you?

Note: There are currently 2 tree implementations, one in spark.mllib (which you're modifying) and one in spark.ml. I eventually want to remove the spark.mllib one and wrap the spark.ml one [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]. We can commit this here and then copy it over to spark.ml in another PR.

Thanks!

NathanHowell · 2015-09-10T17:50:45Z

@jkbradley yes, this fixed a real problem. the problem stemmed from attempting to use categorical features with high arity (250+), which has a side effect of increasing the number of bins used for continuous features and massively increasing the sample size. the sampling and bin calculation would take multiple hours. after the patch it completes in ~2 minutes.

training itself would run okay but also wasn't terribly fast. I've had to switch to a different RF implementation (https://github.com/stripe/brushfire) in order to have training complete in a reasonable amount of time. the mllib implementation would train a single tree in 3-4 days on 2000 executors, whereas brushfire does the same in about 6 hours with per-split bin calculation and about 45 minutes with log bucketing.

manishamde · 2015-09-10T18:14:49Z

@NathanHowell @jkbradley We should consider making bins per feature and sample sizes configurable to avoid the side-effects mentioned above.

Did the brushfire implementation also have similar number of splits and tree depth? Also, could you describe the training data a little more -- number of samples, number of categorical/continuous features, etc.

NathanHowell · 2015-09-10T18:23:43Z

@manishamde yes, same parameters. this dataset is about 100m examples, not sure offhand on the exact number of features but probably about 5k categorical features (typically with single digit arity) and 100-200k continuous features.

the binning slowdown can also be avoided by using a one-hot encoding but it's not convenient. having per feature bin counts would be nice, and it can (probably?) be ignored for categorical features altogether.

jkbradley · 2015-09-10T18:32:21Z

@NathanHowell The real problem for MLlib is having a large number of features; I'd say a few thousand is a reasonable limit right now. I'm working on a new implementation which should be much faster for more features, scheduled for 1.6.

Curious: When you ran MLlib, did you try useNodeIdCache, with checkpointing turned on? That should help a little.

Also, do you know if the 2 implementations learned similarly sized trees?

NathanHowell · 2015-09-10T18:40:38Z

@jkbradley yeah, I figured out as much. a lot of the internals also assume dense feature space, which isn't very common with a large number of features.

i tried it with useNodeIdCache on and off and it didn't make much of a difference. checkpointing was also enabled... the dataset fits in memory though (8TB was allocated to RDD storage), i don't know how much that changes things.

the lower levels of the trees (near the roots) look similar but i haven't done a thorough comparison of the two.

jkbradley · 2015-09-10T18:48:11Z

Considering sparse representations would be nice, though I think that's a less common use case for trees (vs GLMs). The other (bigger) issue is communication (which is what my new implementation will address). The main difference between MLlib and PLANET right now is that MLlib does not switch to local learning for deep layers of the tree.

For comparing, I'm mainly curious about the total number of nodes in the tree, just to make sure they are fairly similar.

NathanHowell · 2015-09-10T18:59:23Z

@jkbradley it's not that uncommon to use sparse features for time series classification (sparse interval features) with RF.

training a linear model with spark.ml's pipeline is (was?) prohibitively expensive for this dataset. normalizing features using the spark.ml's transforms generates two stages per feature... rather than two stages total (the first for density estimation and a second for the bucket transform). i can do this myself but that sort of defeats the purpose of using ml instead of mllib..

jkbradley · 2015-09-10T19:05:13Z

You should be able to assemble all of the features into one feature vector (using VectorAssembler) and then applying Normalizer, StandardScaler, etc.

NathanHowell · 2015-09-10T19:10:10Z

@jkbradley i was looking at Bucketizer, which seemed to be closest to what i wanted (cumulative density transform), and i certainly could have been doing something wrong... it's not relevant to this ticket though.

jkbradley · 2015-09-10T19:13:03Z

I see. I guess this is the issue: [https://issues.apache.org/jira/browse/SPARK-8418] We can discuss more there if it's helpful.

jkbradley · 2015-09-11T17:09:07Z

ok to test

jkbradley · 2015-09-11T17:09:38Z

Did you have to rebase b/c of merge conflicts?

SparkQA · 2015-09-11T17:43:21Z

Test build #1743 has finished for PR 8246 at commit 93c9e71.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2015-09-12T16:13:34Z

I tend to rebase out of habit to prevent merge-build failures. I'll look at the test failure on Monday, they were all passing at one point.

jkbradley · 2015-09-16T20:16:55Z

I think it's because I asked you to remove the metadata update from that method which now runs on the worker. The test could be modified to look at the length of the returned splits.

jkbradley · 2015-10-06T21:29:05Z

Would you mind updating this? If you don't have time, then let me know, and I'll send a PR to your PR. Thanks

NathanHowell · 2015-10-06T21:37:24Z

I'll have time tomorrow

jkbradley · 2015-10-07T04:13:29Z

OK thanks!

NathanHowell · 2015-10-07T22:16:12Z

There were already tests for the returned split lengths, so I just removed the metadata checks.

jkbradley · 2015-10-07T22:41:24Z

LGTM pending tests

SparkQA · 2015-10-07T23:28:07Z

Test build #1855 has finished for PR 8246 at commit abdcc47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-10-08T00:45:25Z

Merging with master. Would you be able to make a similar change for the spark.ml implementation? Thank you!

This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib. Author: sethah <[email protected]> Closes apache#10231 from sethah/SPARK-12182.

manishamde reviewed Aug 28, 2015
View reviewed changes

[SPARK-10064] [ML] Parallelize decision tree bin split calculations

77a8600

NathanHowell force-pushed the SPARK-10064 branch from 93c9e71 to 77a8600 Compare September 10, 2015 21:30

Remove unnecessary metadata checks in continuous split tests

abdcc47

asfgit closed this in 1bc435a Oct 8, 2015

sethah mentioned this pull request Dec 9, 2015

[SPARK-12182][ML] Distributed binning for trees in spark.ml #10231

Closed

NathanHowell deleted the SPARK-10064 branch December 8, 2016 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246

[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246

NathanHowell commented Aug 17, 2015

NathanHowell commented Aug 18, 2015

jkbradley commented Aug 18, 2015

NathanHowell commented Aug 18, 2015

manishamde commented Aug 28, 2015

manishamde Aug 28, 2015

NathanHowell Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

manishamde commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

jkbradley commented Sep 11, 2015

jkbradley commented Sep 11, 2015

SparkQA commented Sep 11, 2015

NathanHowell commented Sep 12, 2015

jkbradley commented Sep 16, 2015

jkbradley commented Oct 6, 2015

NathanHowell commented Oct 6, 2015

jkbradley commented Oct 7, 2015

NathanHowell commented Oct 7, 2015

jkbradley commented Oct 7, 2015

SparkQA commented Oct 7, 2015

jkbradley commented Oct 8, 2015

[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246

[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246

Conversation

NathanHowell commented Aug 17, 2015

NathanHowell commented Aug 18, 2015

jkbradley commented Aug 18, 2015

NathanHowell commented Aug 18, 2015

manishamde commented Aug 28, 2015

manishamde Aug 28, 2015

Choose a reason for hiding this comment

NathanHowell Sep 10, 2015

Choose a reason for hiding this comment

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

manishamde commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

NathanHowell commented Sep 10, 2015

jkbradley commented Sep 10, 2015

jkbradley commented Sep 11, 2015

jkbradley commented Sep 11, 2015

SparkQA commented Sep 11, 2015

NathanHowell commented Sep 12, 2015

jkbradley commented Sep 16, 2015

jkbradley commented Oct 6, 2015

NathanHowell commented Oct 6, 2015

jkbradley commented Oct 7, 2015

NathanHowell commented Oct 7, 2015

jkbradley commented Oct 7, 2015

SparkQA commented Oct 7, 2015

jkbradley commented Oct 8, 2015