Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6259][MLlib] Python API for LDA #6791

Closed
wants to merge 15 commits into from
Closed

Conversation

yu-iskw
Copy link
Contributor

@yu-iskw yu-iskw commented Jun 12, 2015

I implemented the Python API for LDA. But I didn't implemented a method for LDAModel.describeTopics(), beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.

TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of describeTopics in Scala consists of Tuple classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34792 has finished for PR 6791 at commit c90fcfb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jun 12, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34788 has finished for PR 6791 at commit 44a3a03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jun 12, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34794 has finished for PR 6791 at commit c90fcfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jun 12, 2015

@mengxr Could you review this PR at at your earliest convenience? Thanks and happy Friday!

* Java stub for Python mllib LDA.run()
*/
def trainLDAModel(
data: JavaRDD[LabeledPoint],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from discussion: Let's avoid use of LabeledPoint. We can use fromTuple2RDD to handle the pair of values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, it seems that we can't deal with JavaRDD[(Long, Vector)] in the parameter of trainLDAModel. After all, I got the error message as follows, when I run the test. Tuple of Python was recognized as java.lang.Object in Java.

java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to scala.Tuple2
        at org.apache.spark.mllib.api.python.PythonMLLibAPI$$anonfun$6.apply(PythonMLLibAPI.scala:489)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
...

According to trainFPGrowthModel and trainWord2Vec, dealing with input data as array would be better. What do you think about this implementation?

yu-iskw@cdb2cf1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just sent a PR against this one, which should get around this issue. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, it is a little difficult to decide which is the better. The different point between yours and mine from the users point of view are :

  • Each row type is an array or a tuple
  • Feature data type is array or DenceVector/SparceVector
  • Does yours support sparse vector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested mine with tuples yet; you're right we should try that.

Mine should support arrays, Vectors (dense & sparse), numpy, and scipy types since it passes everything through _convert_to_vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for letting me know about converting those types with _convert_to_vector.
And sorry about that our discussion is here and there on github :( Could you please read my survey?
yu-iskw#3 (comment)

I still wonder how we should treat Long in Java from python.

@SparkQA
Copy link

SparkQA commented Jun 23, 2015

Test build #35556 has finished for PR 6791 at commit c3aa6ef.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

@SparkQA
Copy link

SparkQA commented Jun 23, 2015

Test build #35559 has finished for PR 6791 at commit 4a559ad.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

if (seed != null) algo.setSeed(seed)

val documents = data.rdd.map(_.asScala.toArray).map { r =>
r(0).getClass.getSimpleName match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r(0) match {
case i: java.lang.Integer => i.toLong
case l: java.lang.Long => l
}

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35611 has finished for PR 6791 at commit 0cd6e04.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

val documents = data.rdd.map(_.asScala.toArray).map { r =>
r(0) match {
case i: java.lang.Integer =>
(r(0).asInstanceOf[java.lang.Integer].toLong, r(1).asInstanceOf[Vector])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify: r(0).asInstanceOf[java.lang.Integer].toLong -> i.toLong

@jkbradley
Copy link
Member

@yu-iskw Would you mind making those 2 updates & fixing the merge issues. Other than that, this looks ready.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36377 has finished for PR 6791 at commit 116dd61.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jul 2, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36380 has finished for PR 6791 at commit 116dd61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

:param seed: Random Seed
:param checkpointInterval: Period (in iterations) between checkpoints.
:param optimizer: LDAOptimizer used to perform the actual calculation
(default = EMLDAOptimizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have noticed this earlier: This should say "em" since that is the actual value specified. Can it also say the 2 supported values (em and online)?

@jkbradley
Copy link
Member

Just that one item...

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jul 2, 2015

@jkbradley Thank you for your feedback. I added the comment.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36443 has finished for PR 6791 at commit 30eb8e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

:param seed: Random Seed
:param checkpointInterval: Period (in iterations) between checkpoints.
:param optimizer: LDAOptimizer used to perform the actual calculation
(default = EMLDAOptimizer). Currently "em", "online" are supported. Default to "em".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to write only 1 default (remove "(default = EMLDAOptimizer)").

@jkbradley
Copy link
Member

LGTM pending tests. Thanks very much!

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36878 has finished for PR 6791 at commit d8b1dc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA():

@jkbradley
Copy link
Member

Wait, I just noticed a thing or two. Sorry I missed them before!

@@ -562,5 +564,67 @@ def _test():
exit(-1)


class LDAModel(JavaModelWrapper):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies In Scala, LDAModel is abstract. LocalLDAModel and DistributedLDAModel inherit from it. We should eventually have this same setup in Python. What is needed to maintain backwards compatibility? If we add this API in Spark 1.5, can we later make LDAModel abstract, and have LocalLDAModel and DistributedLDAModel inherit from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies What do you think about that? Should we also support the local one in Python?

@jkbradley
Copy link
Member

@yu-iskw It's possible we can address these issues in a follow-up PR, but I wanted to ask @davies about it before merging.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jul 9, 2015

@jkbradley I got it. Thank you for letting me know.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jul 15, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37284 has finished for PR 6791 at commit 788db8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA(object):

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #17 has finished for PR 6791 at commit 6855f59.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37286 has finished for PR 6791 at commit 6855f59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA(object):

@jkbradley
Copy link
Member

LGTM, merging with master
Thank you for the PR (and for rebasing)!

@asfgit asfgit closed this in 4692769 Jul 15, 2015
@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jul 15, 2015

@jkbradley thank you for merging it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants