[SPARK-6259][MLlib] Python API for LDA #6791

yu-iskw · 2015-06-12T20:21:57Z

I implemented the Python API for LDA. But I didn't implemented a method for LDAModel.describeTopics(), beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.

TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of describeTopics in Scala consists of Tuple classes.

SparkQA · 2015-06-12T20:37:07Z

Test build #34792 has finished for PR 6791 at commit c90fcfb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

yu-iskw · 2015-06-12T20:40:17Z

Jenkins, test this please.

SparkQA · 2015-06-12T21:54:59Z

Test build #34788 has finished for PR 6791 at commit 44a3a03.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

yu-iskw · 2015-06-12T22:06:38Z

Jenkins, test this please.

SparkQA · 2015-06-12T22:15:41Z

Test build #34794 has finished for PR 6791 at commit c90fcfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

yu-iskw · 2015-06-12T23:42:40Z

@mengxr Could you review this PR at at your earliest convenience? Thanks and happy Friday!

jkbradley · 2015-06-18T19:42:53Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+   * Java stub for Python mllib LDA.run()
+   */
+  def trainLDAModel(
+      data: JavaRDD[LabeledPoint],


Note from discussion: Let's avoid use of LabeledPoint. We can use fromTuple2RDD to handle the pair of values.

Umm, it seems that we can't deal with JavaRDD[(Long, Vector)] in the parameter of trainLDAModel. After all, I got the error message as follows, when I run the test. Tuple of Python was recognized as java.lang.Object in Java.

java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to scala.Tuple2 at org.apache.spark.mllib.api.python.PythonMLLibAPI$$anonfun$6.apply(PythonMLLibAPI.scala:489) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) ...

According to trainFPGrowthModel and trainWord2Vec, dealing with input data as array would be better. What do you think about this implementation?

yu-iskw@cdb2cf1

I just sent a PR against this one, which should get around this issue. Let me know what you think.

Umm, it is a little difficult to decide which is the better. The different point between yours and mine from the users point of view are :

Each row type is an array or a tuple

Feature data type is array or DenceVector/SparceVector

Does yours support sparse vector?

I haven't tested mine with tuples yet; you're right we should try that.

Mine should support arrays, Vectors (dense & sparse), numpy, and scipy types since it passes everything through _convert_to_vector.

Thank you for letting me know about converting those types with _convert_to_vector.
And sorry about that our discussion is here and there on github :( Could you please read my survey?
yu-iskw#3 (comment)

I still wonder how we should treat Long in Java from python.

SparkQA · 2015-06-23T15:50:07Z

Test build #35556 has finished for PR 6791 at commit c3aa6ef.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

SparkQA · 2015-06-23T18:08:34Z

Test build #35559 has finished for PR 6791 at commit 4a559ad.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

davies · 2015-06-23T20:15:10Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+    if (seed != null) algo.setSeed(seed)
+
+    val documents = data.rdd.map(_.asScala.toArray).map { r =>
+      r(0).getClass.getSimpleName match {


r(0) match {
case i: java.lang.Integer => i.toLong
case l: java.lang.Long => l
}

SparkQA · 2015-06-24T02:41:01Z

Test build #35611 has finished for PR 6791 at commit 0cd6e04.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

jkbradley · 2015-06-30T19:38:30Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+    val documents = data.rdd.map(_.asScala.toArray).map { r =>
+      r(0) match {
+        case i: java.lang.Integer =>
+          (r(0).asInstanceOf[java.lang.Integer].toLong, r(1).asInstanceOf[Vector])


Simplify: r(0).asInstanceOf[java.lang.Integer].toLong -> i.toLong

jkbradley · 2015-06-30T19:39:02Z

@yu-iskw Would you mind making those 2 updates & fixing the merge issues. Other than that, this looks ready.

SparkQA · 2015-07-02T08:59:51Z

Test build #36377 has finished for PR 6791 at commit 116dd61.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

yu-iskw · 2015-07-02T09:17:20Z

Jenkins, test this please.

SparkQA · 2015-07-02T10:01:49Z

Test build #36380 has finished for PR 6791 at commit 116dd61.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

jkbradley · 2015-07-02T22:23:24Z

python/pyspark/mllib/clustering.py

+        :param seed:                Random Seed
+        :param checkpointInterval:  Period (in iterations) between checkpoints.
+        :param optimizer:           LDAOptimizer used to perform the actual calculation
+            (default = EMLDAOptimizer)


Sorry, I should have noticed this earlier: This should say "em" since that is the actual value specified. Can it also say the 2 supported values (em and online)?

jkbradley · 2015-07-02T22:23:34Z

Just that one item...

yu-iskw · 2015-07-02T23:10:28Z

@jkbradley Thank you for your feedback. I added the comment.

SparkQA · 2015-07-02T23:49:47Z

Test build #36443 has finished for PR 6791 at commit 30eb8e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

jkbradley · 2015-07-05T19:54:10Z

python/pyspark/mllib/clustering.py

+        :param seed:                Random Seed
+        :param checkpointInterval:  Period (in iterations) between checkpoints.
+        :param optimizer:           LDAOptimizer used to perform the actual calculation
+            (default = EMLDAOptimizer). Currently "em", "online" are supported. Default to "em".


I think it would make more sense to write only 1 default (remove "(default = EMLDAOptimizer)").

jkbradley · 2015-07-09T02:31:40Z

LGTM pending tests. Thanks very much!

SparkQA · 2015-07-09T02:46:31Z

Test build #36878 has finished for PR 6791 at commit d8b1dc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA():

jkbradley · 2015-07-09T02:53:22Z

Wait, I just noticed a thing or two. Sorry I missed them before!

jkbradley · 2015-07-09T02:55:38Z

python/pyspark/mllib/clustering.py

@@ -562,5 +564,67 @@ def _test():
        exit(-1)


+class LDAModel(JavaModelWrapper):


@davies In Scala, LDAModel is abstract. LocalLDAModel and DistributedLDAModel inherit from it. We should eventually have this same setup in Python. What is needed to maintain backwards compatibility? If we add this API in Spark 1.5, can we later make LDAModel abstract, and have LocalLDAModel and DistributedLDAModel inherit from it?

@davies What do you think about that? Should we also support the local one in Python?

jkbradley · 2015-07-09T02:56:41Z

@yu-iskw It's possible we can address these issues in a follow-up PR, but I wanted to ask @davies about it before merging.

yu-iskw · 2015-07-09T03:04:14Z

@jkbradley I got it. Thank you for letting me know.

…ted.

…optimizer`

yu-iskw · 2015-07-15T00:32:57Z

Jenkins, test this please.

SparkQA · 2015-07-15T00:34:14Z

Test build #37284 has finished for PR 6791 at commit 788db8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA(object):

SparkQA · 2015-07-15T00:36:25Z

Test build #17 has finished for PR 6791 at commit 6855f59.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-15T01:12:58Z

Test build #37286 has finished for PR 6791 at commit 6855f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA(object):

jkbradley · 2015-07-15T06:27:23Z

LGTM, merging with master
Thank you for the PR (and for rebasing)!

yu-iskw · 2015-07-15T06:32:35Z

@jkbradley thank you for merging it!

jkbradley reviewed Jun 18, 2015
View reviewed changes

davies reviewed Jun 23, 2015
View reviewed changes

jkbradley reviewed Jun 30, 2015
View reviewed changes

yu-iskw force-pushed the SPARK-6259 branch from 0cd6e04 to 74b7e57 Compare July 2, 2015 08:42

jkbradley reviewed Jul 2, 2015
View reviewed changes

jkbradley reviewed Jul 5, 2015
View reviewed changes

jkbradley reviewed Jul 9, 2015
View reviewed changes

yu-iskw added 15 commits July 15, 2015 09:03

Resolve conflicts with rebasing

25ef2ac

Support some parameters for ALS.train() in Python

68f0653

Not use LabeledPoint

77fd1b7

Fix the validation problems by lint-scala

8117e18

Modify how to cast the input data

39514ec

Fix the indentation

2278829

Fix the typo

73412c3

Remove the unnecessary import in Python unit testing

57ac03d

Remove the interface for describeTopics. Because it is not implemen…

98f645a

…ted.

Add some comments for the LDA paramters

faa9764

Simplify casting

9f8bed8

Add the comment about the supported values and the default value of `…

083e226

…optimizer`

Remove the doc comment about the optimizer's default value

d7a332a

Change the place of testing code

28bd165

LDA inherits object

6855f59

yu-iskw force-pushed the SPARK-6259 branch from 788db8e to 6855f59 Compare July 15, 2015 00:32

asfgit closed this in 4692769 Jul 15, 2015

		@@ -562,5 +564,67 @@ def _test():
		exit(-1)


		class LDAModel(JavaModelWrapper):

[SPARK-6259][MLlib] Python API for LDA #6791

[SPARK-6259][MLlib] Python API for LDA #6791

Conversation

yu-iskw commented Jun 12, 2015

SparkQA commented Jun 12, 2015

yu-iskw commented Jun 12, 2015

SparkQA commented Jun 12, 2015

yu-iskw commented Jun 12, 2015

SparkQA commented Jun 12, 2015

yu-iskw commented Jun 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 23, 2015

SparkQA commented Jun 23, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 24, 2015

Choose a reason for hiding this comment

jkbradley commented Jun 30, 2015

SparkQA commented Jul 2, 2015

yu-iskw commented Jul 2, 2015

SparkQA commented Jul 2, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 2, 2015

yu-iskw commented Jul 2, 2015

SparkQA commented Jul 2, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 9, 2015

SparkQA commented Jul 9, 2015

jkbradley commented Jul 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Jul 9, 2015

yu-iskw commented Jul 9, 2015

yu-iskw commented Jul 15, 2015

SparkQA commented Jul 15, 2015

SparkQA commented Jul 15, 2015

SparkQA commented Jul 15, 2015

jkbradley commented Jul 15, 2015

yu-iskw commented Jul 15, 2015