-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLLIB] SPARK-2311: Added additional GLMs (Poisson and Gamma) into MLlib #1237
Conversation
Would you mind creating a JIRA for this and formatting the title correctly? See the green box here - thanks! https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark |
Oops! I didn't notice this one. Created #1243 just now. We actually implemented exactly the same idea of Poisson regression, with only some tiny differences on calculating the gradient of the negative log-likelihood and the test suites. Commented inline in the code. Please check it. |
val brzWeights = weights.toBreeze | ||
val dotProd = brzWeights.dot(brzData) | ||
val diff = math.exp(dotProd) - label | ||
val loss = -dotProd * label + math.exp(dotProd) + fact(label.toInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can safely remove the fact(.) part, because it has virtually nothing to do with the resulted weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Removed it
…est cases. Added a Poisson regression data generator for generating multi-dimensional test data.
Merging some of the features in #1243 to this PR via xwei-datageek#2. Please take a review. |
LBFGS optimier and new test cases for Poisson and Gamma regression
Could one of the admins verify this patch? |
One more thing. Per our discussion in the line note, let's change SimpleUpdater to SquaredL2Updater. |
@mengxr Please review this. |
@xwei-datageek @BaiGang The current naming scheme Problem+Algorithm doesn't scale. I'm working on some standardized interfaces so that we can decouple them. Do you mind me doing the review after that is done? Thanks! |
@mengxr Sure. Never mind. It will be great to have standard and decoupled interfaces. BTW, do we have a JIRA or pull request for tracking these changes? |
@mengxr I was just wondering when (approximately) will the standardized interfaces to decouple Problem+Algorithm be finished? |
Sorry, I'm still working on it and will put the design doc to JIRA soon. But unfortunately, it may not be able to catch the v1.1 release. |
Can one of the admins verify this patch? |
I imagine this is too far out of date, and perhaps obsolete given the new ML API coming. Mind closing this PR? |
@srowen This work is originally for version 1.0.x and is pretty out-dated. @xwei-datageek Xiaokai, I think it's ok to close this PR. As for modeling count data using regression models, I think SparkR with glm package would be a good solution though I have not get deep into it. |
SPARK-2311 - Added additional GLMs (Poisson and Gamma) into MLlib
implemented PoissonRegressionSGD and GammaRegressionSGD.