As part of the Seminar "Machine learning in practice" taught by Lucas Bechberger in the fall of 2021 at the Institute for Cognitive Science, University of Osnabrück, we implement an exemplary machine learning pipeline from preprocessing to the deployment of the application, testing, and trying out different preprocessing, feature extraction, dimensionality reduction, classification and evaluation methods.
The goal of our example project is to predict, given a tweet, whether it will go viral, i.e., receive many likes and retweets. The virality criterion is defined by the sum of likes and retweets, where the threshold is specifiable by the user, but defaults to
As a data source, we use the "Data Science Tweets 2010-2021" data set (version 3) by Ruchi Bhatia from Kaggle, which contains tweets resulting from the searches of "data science", "data analysis" and "data visualization" authored between 2010 and 2021.
Returns a dataset containing only tweets marked as being of the target language.
Assuming the tweet itself, not just its metadata is supposed to be a feature, language consistency is important. Removing tweets in a language other than the target language, (here: English), reduces the complexity of pre-processing steps, as these can now be performed on useable tweets only.
The Dataset contains a column language, tagging each tweet with its (presumed) language. The LanguagePruner Class drops rows from the dataset with a language tag other than the target language.
In addition, we implemented a helper function within that class, which retrieves the tweet count of each language in the dataset and stores the output in data/preprocessing/language_counts.csv
. This output determines the ratio of languages in the set and can help decide whether there is enough available data in the target language.
While discarding data in a foreign language speeds up the preprocessing step by only focusing on useable data, some issues may arise:
- There are a few inconsistencies between the actual language of a tweet and the tag in the language column. The tag might be retrieved from a user's language setting, i.e. Italian, while the same user has incidentally tweeted in English. So removing non-target language tweets may result in some foreign language tweets being overlooked by
LanguagePruner
, while some target-language tweets are removed as well. - If there are not enough tweets in the target language, removal of foreign-language tweets results in too little available data. In this case, textual features, such as TF-IDF, can't be used and the removal of foreign-language tweets should be skipped, while metadata features should be prioritized.
Returns a column containing the tweet's content without URLs.
If a tweet in our dataset contains URLs, they have to be removed so they do not influence feature extraction later on which is based on the English language. Especially features like named-entity-recognition and sentiment-analysis are more efficient if the tweet does not contain URLs.
We used a regular expression in run_preprocessing.py to filter for URLs in the tweet and remove them. This uses the class RegexReplacer in regex_replacer.py.
According to what was argued earlier, removing URLs for the sake of better feature extraction seems perfectly reasonable. But is the feature extraction better? If we strictly remove all URLs, we potentially lose important information: Does the tweet link to a site? If yes, which site is it, and is this a significant contribution to the virality of the tweet? Luckily, the dataset already provides a "URL" column containing the exact URLs to which a tweet refers. If we want to implement features that screen URLs or their contents we are still able to do so!
We lowercase all tweet texts to be able to reliably match different capitalizations of the same word in downstream preprocessors as well as the classifier.
Sometimes, however, capitalization is used to distinguish different concepts:
"We saw a worsening in US relations to China." -> "we saw a worsening in us relations to china."
We implemented a Lowercase class that uses python's lowercase functionality.
This step is generally useful for natural language processing and does not usually result in the loss of task-specific features, e.g. for acronyms that spell out an existing normal word.
Expands contracted words into their long-forms.
Contractions add redundancy, as they technically are separate tokens, even though their components overlap with other tokens. They also usually don't carry much semantic value, since their components are usually stop words. Expanding them to their respective long-form removes this redundancy and assists the stopword removal process that occurs at a later point in the pipeline.
- isn't --> is not --> "is" "not" instead of isn't --> "isn't"
The contractions are expanded by the Expander, while the contraction mapping can be found here. The Implementation stems from towardsdatascience.com and uses a list mapping contractions to their respective long forms.
Adding this preprocessing step is not necessarily crucial to the preprocessing and one might argue that removing it may speed up the pipeline. However, it is a simple way to minimize the vocabulary of our dataset by avoiding unnecessary duplicate tokens and ensuring the fidelity of our model to semantics. To clarify, tokens with the same semantics should be classified as one item in a vocabulary, no matter if they are contracted or not.
Removes all kinds of punctuation from the tweet
We want to lay the focus on word semantics in our classification and match words at a different position within a sentence. Punctuation of any form can be disturbing. Hashtag characters are also removed, conveniently allowing us to treat hashtags as normal words.
We implemented PunctuationRemover which uses a string replacement function to remove punctuation.
Removing punctuation may result in loss of context in cases where punctuation is used to convey emotional content (as with exclamation marks) or vital sentence meaning (as with question marks and commas). However, we believe that punctuation in the context of tweets only marginally influences meaning as many tweeters omit punctuation anyway and the character limit of
Standardizes UK spelling variations to their US equivalent.
Spelling variations arise due to misspellings or location-based differences. Different spellings for the same word add redundancy to our features, as they are counted as different vocabulary, even though their semantics are the same. Changing variations of words, in our case location-based differences, to a standard spelling ensures that semantic information for the words is kept and that they can be further dealt with as the same word.
The tweets are standardized by the Standardizer class, while the spellings mapping can be found here. The Implementation is in line with the implementation for the expansion of contractions, as seen above, and uses as mapping of UK spellings to their respective US spellings. We treat the US spellings as the standard vocabulary and change any UK variations to the standard US spelling.
Adding this preprocessing step is not necessarily crucial to the preprocessing and one might argue that removing it may speed up the pipeline. However, it is a simple way to minimize the vocabulary of our dataset by avoiding unnecessary duplicate tokens and ensuring the fidelity of our model to semantics. To clarify, tokens with the same semantics should be classified as one item in a vocabulary, no matter if they are contracted or not.
Splits the tweet string into word tokens.
To ease feature extraction from tweet texts we split them at word boundaries. We split the tweet texts into lists by word boundaries to be able to count them and run statistical feature extraction more easily.
The tweet texts are tokenized using NLTK (Natural Langauge Toolkit) in the Tokenizer class.
This step is generally necessary to process natural language and aids in the classification of the tweets.
Removes any numerical values
Number expressions have a high variance without much meaningful semantic difference. To improve classification, we decided to replace number expressions in tweets with a generic token.
Numbers are replaced using a regular expression in the RegexReplacer class.
Replacing numbers with a generic token has the advantage of removing unnecessary noise from the dataset to aid in classification, assuming that individual number expressions are irrelevant to the task. Since the dataset specifically encompasses tweets related to data science, there is a chance that tweeters will use numbers more frequently and that numbers have a higher significance to the tweet message, but we believe that the specific value of the number expression does not influence virality.
Reduces a word to its underlying lemma
Our goal was to generalize the form of words as far as possible while retaining their meaning. This enhances comparability between tweets and is a useful precondition for stopword removal.
We created the class Lemmatizer which accesses the WordNetLemmatizer from nltk and used part-of-speech tags to replace the words in the tweet with their lemmas effectively.
We also thought about using Stemming instead of Lemmatization. NLTK stemming is more straightforward to implement and probably has a slightly better runtime. However we still quickly decided on the lemmatizer because its accuracy is significantly better.
Remove very common words from the tweet
Our goal was to get rid of very common English words which can not be used for meaningful features. If we left them in, they would probably not contribute to the quality of our classifier. In addition, this decreases runtime because it makes our dataset smaller and is the perfect preparation for our TF-IDF feature.
We created the class StopwordRemover which accesses a corpus from NLTK containing English stopwords. Every word contained in this corpus and some additional meaningless symbols (specified manually by us) are filtered out.
A significant downside of this preprocessing step is that it can negatively influence our sentiment analyzer by distorting context - depending on which words are filtered and which are not.
Original tweet:
"I really don't love Barack Obama."
Tweet after stopword removal:
"really love Barack Obama."
While being aware of this issue, we still left the preprocessing step in our pipeline because of the reasons mentioned in the "Goal"-section.
The class CharacterLength counts the number of characters used in a given tweet. Perhaps a tweet's virality somewhat depends on its length.
Extracts weekday, month, time of day and season features from the creation time of a tweet.
In the given dataset, the columns date and time contain information regarding the time the tweet was created. What they lack, are meaningful categorical subsets of time. The time of tweet creation may be a valuable feature for our model, but exact dates and seconds probably don't carry too much relevant information. Instead, we have transformed the data to include the month, weekday, season, and time of day that the tweet was created. These features might hold more information as to the chance of a tweet becoming viral and are categorical, thus can be used by any classifier.
Let's say there are
The class OneHotTimeExtractor receives a feature name as an input and retrieves the relevant column (either date or time from the dataset, converts it into Datetime format and transforms the times into categorical data in the following ways:
-
Weekday: Using
datetime.weekday()
function, the feature is returned as an int between$0-6$ -
Month: The month of the posting is retrieved and the feature is returned as an int between
$1-12$ -
Season: According to the month, the feature is returned as an int between
$0-3$ denoting one of four seasons -
Daytime: According to the hour of the posting, the feature is returned as an int between
$0-3$ denoting the time of day.
Finally, the categorical features are converted into one-hot vectors using sklearn's OneHotEncoder to avoid the model learning an ordinal relation between the integer categories.
Transforming the date and time attributes into categorical one-hot vectors untangles the metadata of the time of post into more meaningful units.
Counts the amount items used for a given attribute or returns a
A tweet's virality may be influenced by the amount of an attribute it contains or simply by if the tweet has a certain attribute. The attributes or columns in question here are photos, mentions, hashtags, reply_to, urls, video, and retweet. We assume that at least a few of those attributes influence the likelihood of a tweet going viral, by either simply being included or not, how many of the attributes were included or if the tweet is of a certain type, as would be the case for the video and retweet columns.
The class BooleanCounter receives one of the above-mentioned columns as input, as well as the desired feature type count or boolean. As the columns contain stringified lists, the BooleanCounter
interprets them as the list datatype and determines either the length of the list or whether or not the list is empty.
In the case that the input column is either video or retweet, where the goal is merely to output a boolean feature on if the tweet is of the type or not, BooleanCounter
standardizes the True/False and
Using counts or boolean values is a simple way to represent the above-mentioned attributes without conflating their information value. We assume that the amount of hashtags used in a tweet is a better indicator of the likelihood of virality than the exact hashtags that were used, as we have no additional information on the popularity or reach of a certain hashtag, only that the hashtag was used. If we had access to more information about the above attributes, such as what an attached photo depicts, current events, etc., then there is an alternative feature option to the pure count or 'exists'- boolean of an attribute. However, with our dataset and prediction goal, a count or boolean feature is likely to carry the most information in regards to our hypothesis.
Extracts negativity, positivity, and neutrality scores for the tweet texts.
Our hypothesis is that emotionality and subjectivity of tweet content influences virality. We, therefore, chose to employ a sentiment analyzer to extract positive and negative sentiment from the tweets.
We use the VADER sentiment analyzer to extract positive, negative, and neutral sentiment scores that range from
The sentiment is often cited as one of the driving forces of content in social networks. We thus believe that it also plays a role in predicting tweet virality. However, the method of sentiment analysis used by the VADER project does not take into account sentence-level semantics but merely word-level semantics. Specifically, it uses precalculated scores for the words it finds in the tweet to calculate average sentiment scores for the whole text. This is a rather naive approach, but we believe it to be a worthwhile tradeoff between added value and performance.
Extracts "novelty" scores for the
To provide the classifier with words that are relevant for classification, we use a TF-IDF approach, which calculates the term frequency divided by the inverse document frequency.
We use the TfIdfVectorizer from scikit-learn to calculate TF-IDF scores for the top
By novel words in tweets, we allow the classifier to infer the influence of specific words on tweet virality. Removing stop words in the preprocessing step helps weed out fill and function words from closed classes that do not carry much semantic relevance. However, restricting our TF-IDF vocabulary to
Extract a boolean feature for whether tweets are part of a "thread".
Tweeters can group multiple tweets using an informal mechanism called threads. To help classify the virality of tweets we detect whether a tweet is part of a thread.
We use a simple Regular Expression to match the thread emoji, as well as number expressions like
We considered matching the word 'thread' as well but decided against it since many tweets that merely respond to threads also mention the word thread. Threads may either affect virality positively because tweeters can post more content in one go, or negatively. After all, the audience is not patient enough to read the whole thread. One way or another, we believe threads are an important attribute in characterizing a tweet and may thus be equally important in influencing virality.
Counts the occurrences of SpaCy named entities
A critical factor of a tweet's virality could be the number of referrals to a person, organization, country, etc. within the tweet. Therefore we wanted to feed our classifier this information.
We implemented the class NER which computes the entity type of each word in a tweet using SpaCy's Entity Recognizer and a trained English pipeline. In a preliminary implementation, we used NLTK for this but it recognized entities much worse so we switched to SpaCy. After this, the entities are counted, the counts are appended in an array and all the arrays are fed to our classifier.
During testing, we noticed a major shortcoming of this feature: Most of the time it only recognizes persons and money as entities. This means that the feature is unfortunately not as useful as we initially expected. We decided to use it nevertheless because the recognition of just these entities can also help with classification.
After feature extraction, we experimentally visualized our feature space using t-distributed stochastic neighbor embedding (t-SNE), which stochastically embeds the complete feature space in a two-dimensional space for visualization purposes (positive samples are green, negative ones are red). The result is not very encouraging in terms of being able to learn a classification from the feature space, however, t-SNE mainly accounts for variance and does not take into account the informativeness of the individual dimensions concerning the task. Unsurprisingly, our attempts to use dimensionality reduction techniques, which also mostly take into account variance, to improve the classifier were mostly futile.
Mututal Information measures the amount of information obtained about one variable when considering another.
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower-dimensional space.
We wanted to remove correlated features that do not help our classification while minimizing the risk of overfitting.
We used the PCA class to project the features onto k many dimensions, while k can be specified in the CLI by the user. Additionally, we iterated through the components of this projected feature space. If --verbose is set we print useful user information indicating the composition of the feature space in percent.
A possible shortcoming of PCA could be unstandardized data. Luckily we take care of that in our preprocessing (See standardize.py). In addition we have to be very careful about our selection of k such that we don't lose too much information. We figured that k=20 would lead to the best results.
TSVD as implemented by SciKit learn, while ultimately realizing principal component analysis as well, might be a faster algorithm than the one implemented by the PCA class, especially for large sparse data sets. In our case, it affected neither runtime nor classification performance.
The machine learning model is the central piece of our pipeline that has to learn a concept from the features we extracted. We chose to implement and compare the following models.
The Majority Vote Classifier classifies an instance based on the majority class in the training set.
If a dataset has a class distribution of
Compared to the Label-Frequency Classifier, the Majority Vote Classifier performs much better, reaching around
Metric | Training | Validation |
---|---|---|
Acc | 0.8765 | 0.8768 |
Kappa | 0.0000 | 0.0000 |
0.0000 | 0.0000 | |
Balanced | 0.5000 | 0.5000 |
Classifies an instance according to the distribution of labels in the training set.
The label-frequency classifier serves as an alternative baseline to the majority vote classifier.
We use sklearn's DummyClassifier on the 'stratified' strategy in run_classifier.py.
Compared to the Majority Vote Classifier, the Label-Frequency Classifier performs poorly, reaching around
Metric | Training | Validation |
---|---|---|
Acc | 0.7853 | 0.7856 |
Kappa | 0.0005 | -0.0053 |
0.1228 | 0.1166 | |
Balanced | 0.5001 | 0.5028 |
Support Vector Machines have historically been very successful in binary classification tasks and were usually perceived as the silver bullet until the dawn of Neural Networks.
We ran our data set through an SVM classifier as we believed it would perform best, together with the Multi-Layer Perceptron.
We employed SciKit learn's LinearSVC implementation in run_classifier.py.
Results after hyperparameter optimization reveal that training using this classifier went comparatively well and did neither overfit nor underfit. Nonetheless, a balanced accuracy of
Metric | Training | Validation |
---|---|---|
Acc | 0.7181 | 0.7178 |
Kappa | 0.2042 | 0.2003 |
0.3200 | 0.3165 | |
Balanced | 0.7120 | 0.7071 |
Classifies an instance based on the majority label of the k data points closest in the feature space.
We chose KNN as an additional classifier out of interest, after it having been mentioned in prior courses.
We use sklearn's KNeighbors Classifier in run_classifier.py.
Between the different algorithms (Auto, Ball Tree, KD Tree, and Brute) to choose from, Auto achieved the overall best results on the training and validation set. We tested both uniform and distance weights, with different
Extract from Hyperparameter Optimization Evaluation Results*:
Algorithm | Metric | Training | Validation |
---|---|---|---|
BallTree | Acc | 0.8060 | 0.7982 |
| Kappa | 0.1544 | 0.1128 |
| 0.2657 | 0.2284 | |
KD Tree | Acc | 0.8064 | 0.7998 |
| Kappa | 0.1523 | 0.1151 |
| 0.2653 | 0.2298 | |
Auto uni, 5 | Acc | 0.8824 | 0.8684 |
| Kappa | 0.1623 | 0.0587 |
| 0.1940 | 0.0939 | |
Auto d, 10 | Acc | 0.9338 | 0.8553 |
| Kappa | 0.6077 | 0.0434 |
| 0.6394 | 0.0980 |
* performed on a subset of the dataset for efficiency
Metric | Training | Validation |
---|---|---|
Acc | 0.9928 | 0.9020 |
Balanced Acc | 0.9661 | 0.5702 |
0.9605 | 0.2367 | |
Kappa | 0.9565 | 0.1963 |
KNN yields different results, depending on if sparse or dense features are used. Thus, adding LaPlace Smoothening and/or using OneHotEncoder's 'dense' parameter might have improved the performance.
Overall, KNN performs very poorly in comparison to the baseline classifiers and is highly overfitted to the training set, with the final hyperparameter configuration only achieving
Optimizes the log-loss function using the stochastic gradient descent or L-BFGS.
Due to the enhanced sophistication of MLP compared to KNN or SVM, we thought the quality of classification to be significantly better.
We used the class MLPClassifier which takes multiple hyperparameters. These can be specified in the CLI. If none are specified, the parameters that performed best during testing are selected. We optimized the hyperparameters using grid_search.sh and mlflow. Tested hyperparameters include hidden_layer_sizes, activation, solver, and max_fun. The last did not make an impact on classification quality at all.
To our surprise, MLP performed slightly worse than SVM. Especially the balanced accuracy is significantly lower (∆ 0.1390 for the validation set) than the balanced accuracy of SVM.
Best Results from Hyperparameter Optimization:
Metric | Training | Validation |
---|---|---|
Acc | 0.8910 | 0.8936 |
Kappa | 0.1757 | 0.1951 |
0.2273 | 0.2455 | |
Balanced | 0.5681 | 0.5756 |
We specified the hyperparameters leading to this result as default values for classification with MLP in run_classifier.py.
Evaluation metrics inform decisions about the quality of a classifier. We chose to implement the following metrics for this task.
Accuracy is a very basic measure of binary classification performance, which is not very reliable in the face of imbalanced data (a naive always-false classifier would yield an accuracy of
Balanced accuracy, while similar to normal accuracy, yields an informative metric even for imbalanced data. It results from taking the mean of true positive rate and true negative rate. As we are dealing with a heavily imbalanced data set balanced accuracy is a useful metric for assessing model performance.
Informedness, also known as Youden's J statistic or Youden's index, measures the probability of an informed decision (as opposed to a random guess) and is not skewed by imbalanced data such as the current data set.
Cohen's Kappa measures inter-rater reliability between the ground truth and the classifier in a range from
The $F_1$ score combines precision and recall scores into a singular value between
Matthew's Correlation Coefficient is mathematically similar to Cohen's Kappa and measures the correlation between ground truth and model predictions on a range between
Our best setup turns out to be Support Vector Machines without dimensionality reduction. This may be because PCA does not take labels into account when selecting the best dimensions. Additionally, the quality of classification in this task probably depends on a multitude of minor features.
Running our best classifier on the split-off test data set achieves the following evaluation scores:
Metric | Values |
---|---|
Acc | 0.7155 |
Kappa | 0.1984 |
0.3151 | |
Balanced Acc | 0.7064 |
Informedness | 0.4128 |
MCC | 0.2582 |
This result indicates a good, even if not outstanding, performance of the classifier in generalizing the phenomenon of tweet virality from our training and validation sets to previously unseen data. While the raw accuracy score is lower than on our baseline model ($0.7853$S), Cohen's Kappa is markedly higher (Baseline:
Given the features, we extracted from the data and the model we selected, it appears to be possible to classify tweets as viral to some extent, even if we would have hoped for better accuracy.
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: {inlineMath: [['$', '$']]}, messageStyle: "none" });</script>