-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6259][MLlib] Python API for LDA #6791
Conversation
Test build #34792 has finished for PR 6791 at commit
|
Jenkins, test this please. |
Test build #34788 has finished for PR 6791 at commit
|
Jenkins, test this please. |
Test build #34794 has finished for PR 6791 at commit
|
@mengxr Could you review this PR at at your earliest convenience? Thanks and happy Friday! |
* Java stub for Python mllib LDA.run() | ||
*/ | ||
def trainLDAModel( | ||
data: JavaRDD[LabeledPoint], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from discussion: Let's avoid use of LabeledPoint. We can use fromTuple2RDD to handle the pair of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, it seems that we can't deal with JavaRDD[(Long, Vector)]
in the parameter of trainLDAModel
. After all, I got the error message as follows, when I run the test. Tuple of Python was recognized as java.lang.Object
in Java.
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to scala.Tuple2
at org.apache.spark.mllib.api.python.PythonMLLibAPI$$anonfun$6.apply(PythonMLLibAPI.scala:489)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
...
According to trainFPGrowthModel
and trainWord2Vec
, dealing with input data as array would be better. What do you think about this implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just sent a PR against this one, which should get around this issue. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, it is a little difficult to decide which is the better. The different point between yours and mine from the users point of view are :
- Each row type is an array or a tuple
- Feature data type is array or DenceVector/SparceVector
- Does yours support sparse vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested mine with tuples yet; you're right we should try that.
Mine should support arrays, Vectors (dense & sparse), numpy, and scipy types since it passes everything through _convert_to_vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for letting me know about converting those types with _convert_to_vector
.
And sorry about that our discussion is here and there on github :( Could you please read my survey?
yu-iskw#3 (comment)
I still wonder how we should treat Long
in Java from python.
Test build #35556 has finished for PR 6791 at commit
|
Test build #35559 has finished for PR 6791 at commit
|
if (seed != null) algo.setSeed(seed) | ||
|
||
val documents = data.rdd.map(_.asScala.toArray).map { r => | ||
r(0).getClass.getSimpleName match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r(0) match {
case i: java.lang.Integer => i.toLong
case l: java.lang.Long => l
}
Test build #35611 has finished for PR 6791 at commit
|
val documents = data.rdd.map(_.asScala.toArray).map { r => | ||
r(0) match { | ||
case i: java.lang.Integer => | ||
(r(0).asInstanceOf[java.lang.Integer].toLong, r(1).asInstanceOf[Vector]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify: r(0).asInstanceOf[java.lang.Integer].toLong
-> i.toLong
@yu-iskw Would you mind making those 2 updates & fixing the merge issues. Other than that, this looks ready. |
Test build #36377 has finished for PR 6791 at commit
|
Jenkins, test this please. |
Test build #36380 has finished for PR 6791 at commit
|
:param seed: Random Seed | ||
:param checkpointInterval: Period (in iterations) between checkpoints. | ||
:param optimizer: LDAOptimizer used to perform the actual calculation | ||
(default = EMLDAOptimizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have noticed this earlier: This should say "em" since that is the actual value specified. Can it also say the 2 supported values (em and online)?
Just that one item... |
@jkbradley Thank you for your feedback. I added the comment. |
Test build #36443 has finished for PR 6791 at commit
|
:param seed: Random Seed | ||
:param checkpointInterval: Period (in iterations) between checkpoints. | ||
:param optimizer: LDAOptimizer used to perform the actual calculation | ||
(default = EMLDAOptimizer). Currently "em", "online" are supported. Default to "em". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense to write only 1 default (remove "(default = EMLDAOptimizer)").
LGTM pending tests. Thanks very much! |
Test build #36878 has finished for PR 6791 at commit
|
Wait, I just noticed a thing or two. Sorry I missed them before! |
@@ -562,5 +564,67 @@ def _test(): | |||
exit(-1) | |||
|
|||
|
|||
class LDAModel(JavaModelWrapper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies In Scala, LDAModel is abstract. LocalLDAModel and DistributedLDAModel inherit from it. We should eventually have this same setup in Python. What is needed to maintain backwards compatibility? If we add this API in Spark 1.5, can we later make LDAModel abstract, and have LocalLDAModel and DistributedLDAModel inherit from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies What do you think about that? Should we also support the local one in Python?
@jkbradley I got it. Thank you for letting me know. |
Jenkins, test this please. |
Test build #37284 has finished for PR 6791 at commit
|
Test build #17 has finished for PR 6791 at commit
|
Test build #37286 has finished for PR 6791 at commit
|
LGTM, merging with master |
@jkbradley thank you for merging it! |
I implemented the Python API for LDA. But I didn't implemented a method for
LDAModel.describeTopics()
, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of
describeTopics
in Scala consists of Tuple classes.