This repo holds the prediction algorithms, code, and supporting data for real-time prediction of the 2012 Congressional elections using the Twitter feed. The project is based on a working paper by Mark Huberty, entitled "Voting with your tweet: forecasting elections with social media data".
Thanks to the support of the UC Berkeley Graduate School of Journalism, we published real-time predictions throughout the 2012 election cycle. See Voting with your Tweet for more detail.
A range of very interesting papers have attempted to predict elections based on the content of election-related messages in the Twitter feed. These include papers on the U.S. Presidential Election (O'Connor et al 2010), the German Bundestag (Tumasjan et al 2010), and the British Parliament (Tweetminster 2010). More recently, Daniel Gayo-Avello pointed out various problems with most of these papers. One of the more significant problems he identifies is the lack of /a priori/ prediction of future elections. This project is an experiment doing just that.
The algorithms in the folder were trained on the results of the 2010 United States Congressional elections. They use a bigram bag-of-words language model and the SuperLearner machine learning algorithm proposed by van der Laan et al (2007). Both binary (win/loss) and continuous (vote share) algorithms are provided. More detail can be found in the working paper cited above.
People wanting to experiment with the algorithms should note the following issues:
- The algorithms depend on the SuperLearner package and only work with the 1.x series. This work used SuperLearner v.1.1-18.
- The voteshare predictor uses the arm library. More recent versions changed some underlying function names. The algorithms require arm version 1.3-07.
- The code will run on the latest version of R (2.15.1) but a bug in 2.15.0 will prevent
arm_1.3-07
from installing. - Everything here was tested on R 2.15.1, running on Ubuntu Linux 10.04 LTS.