Skip to content

Ideas & Feature proposals

Lev Konstantinovskiy edited this page Sep 14, 2016 · 38 revisions

A list of ideas for new functionality and projects in Gensim, topic modelling for humans, a scientific Python package for efficient, large-scale topic modeling.

Gensim's design philosophy builds on data streaming to process very large datasets (larger than RAM; potentially infinite). Data points are processed one at a time, in constant RAM.

This places stringent requirements on the internal algorithms used (online learning, single pass methods) as well as their implementation, to achieve top performance and robustness.

If you'd like to work on any of the topics below, or have your own ideas, get in touch on the gensim mailing list.


Integrate a fast k-NN library

Background: Gensim's scope also comprises search for topically related documents. A corpus of documents is analyzed for topics and then indexed, allowing fast queries for "most similar documents".

The current implementation has linear complexity in the number of indexed documents, although with extremely low constant factors. It can retrieve "top-100 most similar" among millions of indexed documents in hundreds of milliseconds on a common laptop. The retrieved results are exact, which is an overkill in many applications: approximate results retrieved in sub-linear time may be enough.

To do: Evaluate and integrate fast libraries for approximate k-NN search. Create a concise API that abstracts from a particular library, so that algorithms that need k-NN accept any implementation that follows this API. Create a gensim wrapper for select k-NN libraries that will follow this API.

The desired API should follow the existing Similarity API, but allowing existing libraries like NearPy, kgraph, ANNoy or similar to be "plugged in" as the k-NN execution engine as well.

Ideally, the solution simply will wrap an existing dedicated, fast, robust library; writing a good and scalable k-NN library from scratch is a non-trivial task. Most academic implementations don't fit the bill; see exploration of the Python landscape and summary at http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/#survivors

Resources: Developers of all the packages mentioned above are responsive, aware of gensim and open to improvements/feedback. Gensim contains a super fast, linear scan for similarity search, which can serve as a baseline for any benchmarking and as API reference.

Github ticket #51 has some theoretical pointers; this benchmark of kNN libraries on Wikipedia data (~3.5 million documents) has practical code and an ecosystem summary.

Distributed computing

Background: Gensim contains distributed implementations of several algorithms. The implementations use Pyro4 for network communication and are fairly low-level.

To do: Investigate + integrate one of the higher level frameworks for distributed computation, so that gensim can plug into them without reinventing the wheel. Implement one of the algorithms in gensim in this framework: for example, adapt the distributed online Latent Semantic Analysis or online LDA that are already in gensim.

The solution must support online computation (data coming as a non-repeatable stream of documents); batch processing is not enough.

Integration with a framework that plays well with Python (i.e. avoids Spark's serialisation ) to disk would be better, so Ibis is a good candidate.

Resources: Ibis Celery, Spark, Disco, Storm, Samza. Get in touch on the mailing list/@radimrehurek/@ogrisel.

Visualizations

Background: Gensim's motto is "topic modelling for humans". Humans like visualizations, and topic modelling lends itself to pie charts and document graphs very naturally.

To do: survey the ways people visualize topic models. Integrate a visualization package, or implement own. Build on top of modern technologies like d3.js, or Continuum's Bokeh.

The scope of this project can range from a few visualization routines to a full HTTP client/server framework, ala Termite.

Example plots: display how topics are composed of words. How documents are composed of topics. How corpus is composed of topics. Make the visualization interactive -- go from words to topics, explore the model.

Resources: Jason Chuang's Termite package. Allison Chaney's TMVE, Topic Model Visualization Engine. pyLDAvis which has been ported from the R package of the same name.

Sanity checks

Background: Gensim newbs sometimes mistakenly pass documents where a whole corpus is assumed. Or pass strings where a list of tokens is assumed etc.

This results in runtime errors, which confuses novices. Even worse, where gensim expects a sequence (~list of tokens) and user passes a string, there is no error (string is also an iterable!) but the individual string letters are silently treated as whole tokens, leading to unexpected results...

To do: Collect/survey common newbie errors and then implement sanity checks that catch them early. This could be a method/set of methods that accept an object and check whether it's really a streamed corpus, whether it's empty, contains strange ids, ids in right order, strange feature weights... In case there's anything fishy, log a warning.

Resources: utils.is_corpus() checks whether the input is a corpus, without destroying the data, in case the content is streamed and non-repeatable. Can serve as a starting point template.

Distributed similarity queries

Background: gensim implements fast routines for similarity retrieval ("give me documents similar to this one, using Latent Semantic Analysis"). The routines can make use of multiple cores (using BLAS), but not multiple machines. For large datasets, it is desirable to store shards in a distributed manner, across a cluster of computers. During querying, collect and merge results from all shards.

To do: Extend the sharding already present in gensim, so that different shards can reside on different computers. Design an API to make the concept of "shards" flexible, so that similarity classes that use different implementations (see k-NN above) can plug into it easily.

The network communication must use a fast protocol (Pyro? ØMQ?), so as to not increase query latency too much.

Resources: gensim mailing list.

Online NNMF

Background: Non-negative matrix factorization is an algorithm similar to Latent Semantic Analysis/Latent Dirichlet Allocation. It falls into matrix factorization methods and can be phrased as an online learning algorithm.

To do: Implement NNMF in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Online NMF. Gensim github issue #132.

Supervised LDA

Background: Users have requested "supervised Latent Dirichlet Allocation, LDA" in gensim.

To do: Implement sLDA in a scalable manner in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Wang&Blei&Fei's sLDA paper.

Ramage &al's Labeled LDA.

Jagarlamundi's Seeded LDA

Implementation of sLDA on David Blei's page.

Gensim github issue #121.

Explicit Semantic Analysis, ESA

Background: ESA is a different method of unsupervised document analysis, using Wikipedia as a resource. It uses a different type of analysis to the [Bayesian/linear algebra/numerical] models already in gensim, making for a refreshing addition.

To do: Implement ESA in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Explicit Semantic Analysis.

There's a tentative pull request on github #107, which is not up to gensim standards and needs further polishing/engineering.

Dynamic Topic Models

Background: Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents.

To do: Implement DTM in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Gensim doesn't include any support for "timed streams", or time tags, at the moment. So part of this project will be engineering a clean API for this new functionality.

Resources: Dynamic Topic Models.

Original Blei&Lafferty article PDF.

Wang&Blei&Heckerman article on Continuous Time Dynamic Topic Model PDF.

Wang&McCallum: "Topics over time" PDF.

Academic implementation of DTM on David Blei's page.

Author-Topic Models

Background: A generative model for documents that extends LDA to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words.

Applications to computing similarity between authors and entropy of author output.

To do: Implement the author-topic model in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Resources: "The Author-Topic Model for Authors and Documents" by Rosen-Zvi&al PDF.

Christopher Grainger offered help.

Model selection

Background: Gensim strives for reasonable defaults/automatically tuned parameters, but there are usually a few knobs that can be turned for each model (such as the number of topics). Currently there is no support for automatically evaluating model quality, trying different values for these parameters. Users must write an explicit for-loop, going through desired values and recording the best performing combination.

To do: Design an API that automates going through desired parameter values and records the best combination. Ideally, the evaluation happens in parallel, to speed things up and make better use of multicore/cluster setups.

The API must support streamed input (iterables) and piping several transformations one after another, each with different set of available parameters (e.g. Dictionary's filter_extremes followed by TF-IDF's normalization followed by LSI's num_topics).

Resources: Scikit-learn has a good API for parameter grid search, including parallelization. Maybe integrate the scikit-learn search methods directly if they're powerful enough to do accept streamed data; otherwise reimplement the loops directly.

Parallelization resources: multiprocessing, joblib, celery.

Nested Hierarchical Dirichlet Processes

Background: Paisley, Wang, Blei, Jordan recently developed a stochastic variational version of nested HDP. It reportedly preforms better than HDP etc. (of course!).

To do: Implement this model (probably extending / replacing the existing online HDP implementation in gensim) and evaluate it. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors), to allow large inputs.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Resources: "Nested Hierarchical Dirichlet Processes" by John Paisley, Chong Wang, David M. Blei and Michael I. Jordan PDF.

Pachinko Allocation Model

Background Li, McCallum developed a hierarchical LDA-like model for document classification. They report 2-5% accuracy improvements over an LDA model on a test corpus. (http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf)

An implementation of this model may provide additional alternatives in choice of model, which in some cases may be helpful.

An implementation must be heavily unit tested and and production-ready. It would use many of the same classes and methods as the LDA, which is a bonus in terms of a first pass at implementation.

Resources Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2004). Hierarchical topic models and the nested Chinese restaurant process. NIPS.

Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society B, 46, 193–227.

Li, W., Blei, D., & McCallum, A. (2007). Nonparametric Bayes pachinko allocation.

Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. ICML.

Minka, T. (2000). Estimating a Dirichlet distribution. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. UAI.

Wallach, H. M. (2006). Topic modeling: beyond bag-ofwords. ICML.

Word Movers Distance for word2vec

https://github.com/piskvorky/gensim/issues/482

SparseTools package

A package for working with sparse matrices, built on top of scipy, optimized with Cython, memory-efficient and fast. An improvement and replacement on recently deprecated scipy's sparsetools package.

Should also include faster (Cythonized) transformations between the "gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to matutils (https://radimrehurek.com/gensim/matutils.html#gensim.matutils.corpus2csc )

Glove word-embedding integration

Integrate or re-write in an optimized way the glove word-embedding code by Maciej Kula (https://github.com/maciejkula/glove-python). Next step would be adding Swivel algorithm support

Automatic topic labeling

Implement algorithm from the paper Automatic Labeling of Multinomial Topic Models Qiaozhu Mei et al Suggestion from Jason Liu

Joint embedding

Slight modification of word2vec for the purpose of sponsored advertising. See this paper "Joint Embedding of Query and Ad by Leveraging Implicit Feedback"

Implement OHDOCLUS – Online and Hierarchical Document Clustering

See https://github.com/ruiEnca/ohDoclus