Integration of Text Mining and Topic Modeling Tools

Background

In the recent decades, we witness the phenomenon of an exponential increase of on-line resources, a lion’s share of which having a form of (unstructured) text. No wonder, then, that text mining and similar fields of automated text analysis have become one of the hottest spots in information technology and in other disciplines, including the humanities, that use IT methods in textual research. The “digital turn” in text analysis means access to previously unheard-of amounts of data; at the same time, however, this presents non-trivial challenges. To name but a few, these include information search and retrieval, data analysis, text categorization, authorship attribution, plagiarism detection, sentiment analysis, forensics, and many others.

A prominent place in text analysis is occupied by a plethora of machine-learning solutions, varying from unsupervised explanatory analysis methods (principal components analysis, multidimensional scaling, hierarchical cluster analysis), to sophisticated supervised methods. To name but a few, these include support vector machines, nearest shrunken centroids, k-nearest neighbor classifier, and so forth. One of the methods that enjoys ever-growing popularity in the field is topic modeling, or a statistical model that aim at discovering abstract “topics” (or co-occurrent cohorts of words) that repeatedly appear in a collection of documents. Most common way of extracting topics from documents is latent dirichlet allocation, an algorithm available in a few programming languages, including Python, Java, and of course R (via the package ‘mallet’, ‘topicmodels’, or ‘lda’). Even if the task of extracting topic models can be accomplished more or less flawlessly in R (some issues will be discussed below), there is no easy way, however, to perform supervised topic modeling analysis using the existing R packages.

Related work

Computer-assisted text analysis (including text mining and topic modelling) is significantly underrepresented among R packages, especially when one compares it to the number of packages for physicists, chemists and biologists. Even though, one has to list a great package ‘tm' that handles basic text mining routines and supports parallel computation (Feinerer et al., 2008), the package ‘stylo’ (https://sites.google.com/site/computationalstylistics) that reads raw texts from hard-drive, applies some pre-processing routines and performs various classification tasks (Eder et al., 2015). Moreover, the package ‘stylo’ is supplemented by a simple GUI, which makes its functionalities easy accessible for novice users.

Fitting topic models is available via the package ‘topicmodels’ (Grün and Hornik, 2011), as well as using the classic software Mallet (McCallum, 2002), written in Java and available in R via the package ‘mallet’ by David Mimno. A valuable extension (or, a wrapper) to this software is the package ‘dfrtopics’ (http://agoldst.github.io/dfrtopics/introduction.html), that was designed to work with the metadata and pre-aggregated text data supplied by JSTOR’s Data for Research service (http://dfr.jstor.org/).

The aforementioned packages have their obvious advantages, at the same time, however, none of them offers a complete workflow to analyze documents harvested from a corpus or, say, a webpage. In particular, there is a significant lack of such a tool that would combine easy-to-accomplish text pre-processing, also for non-English languages (as provided by the package ‘stylo’), high performance using parallel computing infrastructure (as in the package ‘tm'), with an algorithm of fitting topic models (e.g. from the package ‘mallet’), supplemented by an attractive visualization of the obtained topics (as in the package ‘dfrtopics’).

Last but definitely not least, there is no out-of-the-box way of performing supervised topic modeling, in which the procedure is divided into two stages: (i) one extracts co-occurrent words, or topics, from a training subset of documents, and then (ii) one applies the trained model to perform classification on “new” data. This problem is still not challenged in R, even if some attempts have been made, see e.g.: http://stackoverflow.com/questions/34150612/classifying-new-text-using-mallet-package https://gist.github.com/agoldst/edcfd45b5ac371296b76

Details of your coding project

The tasks to be undertaken include:

an integration of the functionalities provided by ‘stylo’, ‘tm' and ‘mallet’.
implementing supervised topic modeling classification.
visualization of topic models inspired by the package ‘dfrtopics’, or by other attempts to visualize proportions of topics in documents and words in topics, see e.g.:
https://dhs.stanford.edu/algorithmic-literacy/using-word-clouds-for-topic-modeling-results/
http://tedunderwood.com/2012/11/11/visualizing-topic-models/
http://jason.chuang.info/
http://vis.stanford.edu/papers/termite
http://vis.stanford.edu/papers/topic-model-diagnostics
preparing a dummy collection of texts (in English and/or in a non-English language) that can be used for benchmarking and will be embedded in an R package

Expected impact

The expected impact in the field of linguistics, literary studies, digital history and the like disciplines seems to be self-evident, which is reflected by an extreme increase of the method in publications and at scholarly events, such as the Digital Humanities annual conference. Apart from the basic research fields, however, the supervised version of topic modeling seems to be a very promising addition to the text mining toolbox. A few introductory tests have shown that it can be used to automatically detect high-brow and low-brow texts (or tabloid vs. non-tabloid texts) in large collections of newspaper articles. It can be safely assumed that the method might be leveraged to assess other classification problems.

Mentors

Prof. Maciej Eder (Institute of Polish Language, Polish Academy of Sciences), maciejeder [at] gmail.com.

Tomasz Melcer (Department of Biomedical Engineering, Wroclaw University of Technology), liori [at] exroot.org.

Tests

Applicants should be familiar with the following languages and routines:

A good working knowledge of programming in R.
Familiarity with text processing in R.
Familiarity with the construction of R packages.
Familiarity with topic modelling (LDA algorithm).
Familiarity with machine-learning approaches.
An experience with Roxygen for the documentation.
A experience with knitr/LaTeX for vignettes.

Applicants should be able to solve the following tests:

Write a function to scrape texts from the Gutenberg Project, that would be able to automatically exclude the Disclaimers (they are added at the end of each document);
Write a function to invoke one of the external taggers, e.g. TreeTagger of Stanford NLP Tagger (please make sure that your code is portable across operating systems);
Write a function to extract only nouns (or: only verbs) from a tagged (see above) text, in order to analyze it further using Topic Modeling approaches, e.g. via the package ‘mallet’.

References

Eder, M., Kestemont, M. and Rybicki, J. (2015). Stylometry with R: a package for computational text analysis. R Journal, 8(1), https://journal.r-project.org/archive/accepted/eder-rybicki-kestemont.pdf

Feinerer, I., Hornik, K. and Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5): 1–54.

Grün, B. and Hornik, K. (2011). topicmodels: an R package for fitting topic models. Journal of Statistical Software, 40(13): 1–30, doi:10.18637/jss.v040.i13.

McCallum, A. K. (2002). MALLET: A machine learning for language toolkit, http://www.cs.umass.edu/~mccallum/mallet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly