-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Ideas Scrapyard
This page created only as "history artifact", contains many ideas that non-relevant now.
Note: Consider integration with existing Python sLDA
Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.
In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.
While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.
-
Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.
Resources:
[3] sLDA implementation in C++
[4] Implementation of sLDA in R
[7] Gensim on github