-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14318/ |
docCounts: Vector, | ||
topicCounts: Vector, | ||
docTopicCounts: Array[Vector], | ||
topicTermCounts: Array[Vector]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that this will be really big - maybe the last two variables should be RDDs - similar to what we do with ALS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's make sense. I think the docTopicCounts
could be sliced easily W.R.T. documents partitions. But for topicTermCounts
, it's hard to do slice. I'll find a way to settle it.
Before I get too deep into this review - I want to step back and think about whether we expect the model in this case to be on the order of the size of the data - I think it is, and if so, we may want to consider representing the model as RDD[DocumentTopicFeatures] and RDD[TopicWordFeatures], similar to what we do with ALS. This may change the algorithm substantially. Separately, maybe it makes sense to have a concrete use case to work with (reuters dataset or something) so that we can evaluate how much memory actually gets used given a reasonably sized corpus. |
Also, speaking of @jegonzal maybe this is a natural first point of integration between MLlib and GraphX - I know the GraphX folks have an implementation of LDA, and maybe this is a chance for us to leverage that work. |
Yep, I know @jegonzal for his paper Parallel Gibbs Sampling. But I only have the idea of the implementation on GraphLab and not find the impl in GraphX. It's great if I have the chance to talk with Joseph offline. Besides, I will add a use case for reuters dataset and try to fix the issues put above. |
I would be happy to talk more about this after the OSDI deadline. As far as storing the model (or more precisely the counts and samples) as an a RDD, I think this really is necessary. The model in this case should be on the order of the size of the data. Essentially what you want is the ability to join the term topic counts with the document topic counts for each token in a given document. Given these two counts tables (along with the background distribution of topics in the entire corpus) you can compute the new topic assignment. Here is an implementation of the collapsed Gibbs sampler for LDA using GraphX: amplab/graphx#113 |
|
||
// Tokenize and filter terms | ||
val almostData = sc.wholeTextFiles(dir, minSplits).map { case (fileName, content) => | ||
val tokens = JavaWordTokenizer(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow users to customize here. We can add a parameter tokenizer: (String) => Iterable[String]
to loadCorpus, and dirStopWords
is not required.
@yinxusen Per discussion on https://issues.apache.org/jira/browse/SPARK-1405, we want to have a GraphX-based implementation and distributed representation of the topic model. Do you mind closing this PR? Thanks for your contribution and @etrain @jegonzal and @witgo for code review! |
`spark_examples_2.11-2.2.0.jar` should be `spark-examples_2.11-2.2.0-k8s-0.3.0.jar`
Merge from upstream
Set the go version for conformance tests
* KE-37052 translate boolean column to V2Predicate * update spark version
…#479) * Revert "KE-37052 translate boolean column to V2Predicate (apache#477)" This reverts commit 7796f19. * KE-37052 translate boolean column to V2Predicate (apache#476) * KE-37052 translate boolean column to V2Predicate * update spark version
(This PR is based on a joint work done with @liancheng four months ago.)
Overview
LDA is a classical topic model in machine learning, that provides the ability to extract topics from corpus. Gibbs sampling (GS for short) is a common way to optimize LDA model.
The LDA model consists of four matrices, two 1-dim matrices:
plus two 2-dim matrices:
Implementation details
SparkContext.wholeTextFiles()
is convenient for offline experimentation, whileSparkContext.textFile()
is better for online applications.Int
IDs.