-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10064] [ML] Parallelize decision tree bin split calculations #8246
Conversation
This sounds great, thanks! I'll need to finish up with QA for 1.5 before taking a look, but please ping me if I don't return to review before long. |
Will do, thanks. |
Thanks @NathanHowell Sorry for not responding earlier. Will try to review soon. |
val bins = { | ||
val lowSplit = new DummyLowSplit(featureIndex, Continuous) | ||
val highSplit = new DummyHighSplit(featureIndex, Continuous) | ||
(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here will be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@NathanHowell I just looked this over, and it seems fine. A few questions: Did you encounter this problem in practice? I would have thought that choosing splits would take much less time than training itself. Was training itself fast for you? Note: There are currently 2 tree implementations, one in spark.mllib (which you're modifying) and one in spark.ml. I eventually want to remove the spark.mllib one and wrap the spark.ml one [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]. We can commit this here and then copy it over to spark.ml in another PR. Thanks! |
@jkbradley yes, this fixed a real problem. the problem stemmed from attempting to use categorical features with high arity (250+), which has a side effect of increasing the number of bins used for continuous features and massively increasing the sample size. the sampling and bin calculation would take multiple hours. after the patch it completes in ~2 minutes. training itself would run okay but also wasn't terribly fast. I've had to switch to a different RF implementation (https://github.com/stripe/brushfire) in order to have training complete in a reasonable amount of time. the mllib implementation would train a single tree in 3-4 days on 2000 executors, whereas brushfire does the same in about 6 hours with per-split bin calculation and about 45 minutes with log bucketing. |
@NathanHowell @jkbradley We should consider making bins per feature and sample sizes configurable to avoid the side-effects mentioned above. Did the brushfire implementation also have similar number of splits and tree depth? Also, could you describe the training data a little more -- number of samples, number of categorical/continuous features, etc. |
@manishamde yes, same parameters. this dataset is about 100m examples, not sure offhand on the exact number of features but probably about 5k categorical features (typically with single digit arity) and 100-200k continuous features. the binning slowdown can also be avoided by using a one-hot encoding but it's not convenient. having per feature bin counts would be nice, and it can (probably?) be ignored for categorical features altogether. |
@NathanHowell The real problem for MLlib is having a large number of features; I'd say a few thousand is a reasonable limit right now. I'm working on a new implementation which should be much faster for more features, scheduled for 1.6. Curious: When you ran MLlib, did you try useNodeIdCache, with checkpointing turned on? That should help a little. Also, do you know if the 2 implementations learned similarly sized trees? |
@jkbradley yeah, I figured out as much. a lot of the internals also assume dense feature space, which isn't very common with a large number of features. i tried it with the lower levels of the trees (near the roots) look similar but i haven't done a thorough comparison of the two. |
Considering sparse representations would be nice, though I think that's a less common use case for trees (vs GLMs). The other (bigger) issue is communication (which is what my new implementation will address). The main difference between MLlib and PLANET right now is that MLlib does not switch to local learning for deep layers of the tree. For comparing, I'm mainly curious about the total number of nodes in the tree, just to make sure they are fairly similar. |
@jkbradley it's not that uncommon to use sparse features for time series classification (sparse interval features) with RF. training a linear model with spark.ml's pipeline is (was?) prohibitively expensive for this dataset. normalizing features using the spark.ml's transforms generates two stages per feature... rather than two stages total (the first for density estimation and a second for the bucket transform). i can do this myself but that sort of defeats the purpose of using |
You should be able to assemble all of the features into one feature vector (using VectorAssembler) and then applying Normalizer, StandardScaler, etc. |
@jkbradley i was looking at |
I see. I guess this is the issue: [https://issues.apache.org/jira/browse/SPARK-8418] We can discuss more there if it's helpful. |
93c9e71
to
77a8600
Compare
ok to test |
Did you have to rebase b/c of merge conflicts? |
Test build #1743 has finished for PR 8246 at commit
|
I tend to rebase out of habit to prevent merge-build failures. I'll look at the test failure on Monday, they were all passing at one point. |
I think it's because I asked you to remove the metadata update from that method which now runs on the worker. The test could be modified to look at the length of the returned splits. |
Would you mind updating this? If you don't have time, then let me know, and I'll send a PR to your PR. Thanks |
I'll have time tomorrow |
OK thanks! |
There were already tests for the returned split lengths, so I just removed the metadata checks. |
LGTM pending tests |
Test build #1855 has finished for PR 8246 at commit
|
Merging with master. Would you be able to make a similar change for the spark.ml implementation? Thank you! |
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib. Author: sethah <[email protected]> Closes apache#10231 from sethah/SPARK-12182.
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib. Author: sethah <[email protected]> Closes apache#10231 from sethah/SPARK-12182.
Reimplement
DecisionTree.findSplitsBins
viaRDD
to parallelize bin calculation.With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.