Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246

Closed
wants to merge 2 commits into from

Conversation

NathanHowell
Copy link

Reimplement DecisionTree.findSplitsBins via RDD to parallelize bin calculation.

With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.

@NathanHowell
Copy link
Author

cc/ @jkbradley @manishamde @mengxr

@jkbradley
Copy link
Member

This sounds great, thanks! I'll need to finish up with QA for 1.5 before taking a look, but please ping me if I don't return to review before long.

@NathanHowell
Copy link
Author

Will do, thanks.

@manishamde
Copy link
Contributor

Thanks @NathanHowell Sorry for not responding earlier. Will try to review soon.

val bins = {
val lowSplit = new DummyLowSplit(featureIndex, Continuous)
val highSplit = new DummyHighSplit(featureIndex, Continuous)
(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here will be helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jkbradley
Copy link
Member

@NathanHowell I just looked this over, and it seems fine. A few questions:

Did you encounter this problem in practice? I would have thought that choosing splits would take much less time than training itself. Was training itself fast for you?

Note: There are currently 2 tree implementations, one in spark.mllib (which you're modifying) and one in spark.ml. I eventually want to remove the spark.mllib one and wrap the spark.ml one [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]. We can commit this here and then copy it over to spark.ml in another PR.

Thanks!

@NathanHowell
Copy link
Author

@jkbradley yes, this fixed a real problem. the problem stemmed from attempting to use categorical features with high arity (250+), which has a side effect of increasing the number of bins used for continuous features and massively increasing the sample size. the sampling and bin calculation would take multiple hours. after the patch it completes in ~2 minutes.

training itself would run okay but also wasn't terribly fast. I've had to switch to a different RF implementation (https://github.com/stripe/brushfire) in order to have training complete in a reasonable amount of time. the mllib implementation would train a single tree in 3-4 days on 2000 executors, whereas brushfire does the same in about 6 hours with per-split bin calculation and about 45 minutes with log bucketing.

@manishamde
Copy link
Contributor

@NathanHowell @jkbradley We should consider making bins per feature and sample sizes configurable to avoid the side-effects mentioned above.

Did the brushfire implementation also have similar number of splits and tree depth? Also, could you describe the training data a little more -- number of samples, number of categorical/continuous features, etc.

@NathanHowell
Copy link
Author

@manishamde yes, same parameters. this dataset is about 100m examples, not sure offhand on the exact number of features but probably about 5k categorical features (typically with single digit arity) and 100-200k continuous features.

the binning slowdown can also be avoided by using a one-hot encoding but it's not convenient. having per feature bin counts would be nice, and it can (probably?) be ignored for categorical features altogether.

@jkbradley
Copy link
Member

@NathanHowell The real problem for MLlib is having a large number of features; I'd say a few thousand is a reasonable limit right now. I'm working on a new implementation which should be much faster for more features, scheduled for 1.6.

Curious: When you ran MLlib, did you try useNodeIdCache, with checkpointing turned on? That should help a little.

Also, do you know if the 2 implementations learned similarly sized trees?

@NathanHowell
Copy link
Author

@jkbradley yeah, I figured out as much. a lot of the internals also assume dense feature space, which isn't very common with a large number of features.

i tried it with useNodeIdCache on and off and it didn't make much of a difference. checkpointing was also enabled... the dataset fits in memory though (8TB was allocated to RDD storage), i don't know how much that changes things.

the lower levels of the trees (near the roots) look similar but i haven't done a thorough comparison of the two.

@jkbradley
Copy link
Member

Considering sparse representations would be nice, though I think that's a less common use case for trees (vs GLMs). The other (bigger) issue is communication (which is what my new implementation will address). The main difference between MLlib and PLANET right now is that MLlib does not switch to local learning for deep layers of the tree.

For comparing, I'm mainly curious about the total number of nodes in the tree, just to make sure they are fairly similar.

@NathanHowell
Copy link
Author

@jkbradley it's not that uncommon to use sparse features for time series classification (sparse interval features) with RF.

training a linear model with spark.ml's pipeline is (was?) prohibitively expensive for this dataset. normalizing features using the spark.ml's transforms generates two stages per feature... rather than two stages total (the first for density estimation and a second for the bucket transform). i can do this myself but that sort of defeats the purpose of using ml instead of mllib..

@jkbradley
Copy link
Member

You should be able to assemble all of the features into one feature vector (using VectorAssembler) and then applying Normalizer, StandardScaler, etc.

@NathanHowell
Copy link
Author

@jkbradley i was looking at Bucketizer, which seemed to be closest to what i wanted (cumulative density transform), and i certainly could have been doing something wrong... it's not relevant to this ticket though.

@jkbradley
Copy link
Member

I see. I guess this is the issue: [https://issues.apache.org/jira/browse/SPARK-8418] We can discuss more there if it's helpful.

@jkbradley
Copy link
Member

ok to test

@jkbradley
Copy link
Member

Did you have to rebase b/c of merge conflicts?

@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #1743 has finished for PR 8246 at commit 93c9e71.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NathanHowell
Copy link
Author

I tend to rebase out of habit to prevent merge-build failures. I'll look at the test failure on Monday, they were all passing at one point.

@jkbradley
Copy link
Member

I think it's because I asked you to remove the metadata update from that method which now runs on the worker. The test could be modified to look at the length of the returned splits.

@jkbradley
Copy link
Member

Would you mind updating this? If you don't have time, then let me know, and I'll send a PR to your PR. Thanks

@NathanHowell
Copy link
Author

I'll have time tomorrow

@jkbradley
Copy link
Member

OK thanks!

@NathanHowell
Copy link
Author

There were already tests for the returned split lengths, so I just removed the metadata checks.

@jkbradley
Copy link
Member

LGTM pending tests

@SparkQA
Copy link

SparkQA commented Oct 7, 2015

Test build #1855 has finished for PR 8246 at commit abdcc47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master. Would you be able to make a similar change for the spark.ml implementation? Thank you!

@asfgit asfgit closed this in 1bc435a Oct 8, 2015
jkbradley pushed a commit to jkbradley/spark that referenced this pull request Mar 20, 2016
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib.

Author: sethah <[email protected]>

Closes apache#10231 from sethah/SPARK-12182.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib.

Author: sethah <[email protected]>

Closes apache#10231 from sethah/SPARK-12182.
@NathanHowell NathanHowell deleted the SPARK-10064 branch December 8, 2016 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants