-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathextensions.tex
265 lines (221 loc) · 14.7 KB
/
extensions.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
\section{Extensions}
\label{sec:extensions}
\subsection{RNN Grid Search}
\label{sec:rnn_grid_search}
It was not clear \textit{a priori} which hyper-parameter choices were
appropriate for the RNN. To find parameters a grid search was performed over
the number of neurons in the recursive cell, the kind of recursive cell
(LSTM or GRU), the dropout rate, and the learning rate.
\begin{figure}[ht]
\includegraphics[width=0.45\textwidth]{Figures/lstm_gru_plot.png}
\caption{Comparing the performance of LSTM and GRU cells over all
parameter measurements}
\label{fig:lstm_gru}
\end{figure}
Figure \ref{fig:lstm_gru} compares the performance of LSTM and GRU cells
over all measurements. The GRU cell achieved a higher
median, lower and upper quartile accuracy than the LSTM cell. However, including
outliers, the GRU cell was less consistent: achieving the lowest accuracy
single result. The highest accuracy was an LSTM cell but this was only by a
small margin.
\begin{figure}[ht]
\includegraphics[width=0.45\textwidth]{Figures/n_neurons_plot-0-0001.png}
\caption{Using a learning rate of 0.0001: the relationship between the number of neurons in the recurrent
cell to classification accuracy for GRU and LSTM at each dropout rate.}
\label{fig:learn_rate_0.0001}
\end{figure}
Figure \ref{fig:learn_rate_0.0001} shows the results for training using a
learning rate of 0.0001. At this learning rate the GRU saw slight improvements in
accuracy with an increasing number of neurons up to 100, after which
accuracy declined. This decline was likely due to minor overfitting of the
training set. However, even the lowest dropout rates only approached
performance on par with no dropout for the largest numbers of neurons
tested. For most of the data, there was a clear accuracy reduction from an
increase in the dropout rate.
The LSTM cell performed much worse than GRU at this learning rate, with
classification accuracy quickly declining for numbers of neurons higher than
110. The dropout rate was similarly detrimental as for the GRU at
small numbers of neurons. As performance degraded at higher numbers of
neurons, moderate dropout rates (up to 0.2) out-performed the network without
dropout. This suggests more severe overfitting than for the GRU cell,
perhaps because LSTM is a more complex construction.
Figure \ref{fig:learn_rate_0.001} shows results for a higher learning rate
of 0.001. At this higher learning rate there was a clear performance
degradation as the number of neurons was increased, particularly up to 100
neurons. The results are also far more chaotic, with measurements spiking up
and down. In further tests it was confirmed that there was a higher variance in
the results at this learning rate compared to the previous. Results at a
further increased learning rate of 0.01 varied so much as to be
un-reproducible. This suggests that a learning rate of 0.001 was too high to
capture the features of this optimisation problem, with the issue worsening
as neurons are added: increasing the complexity of the fitness landscape.
This higher learning rate achieved the highest accuracies in the test when
using smaller neurons. This may indicate that the smaller 0.0001 learning rate grid
search had not trained over sufficient epochs before stopping - giving the
higher learning rate the upper hand in the simpler learning problems.
\begin{figure}[ht]
\includegraphics[width=0.45\textwidth]{Figures/n_neurons_plot-0-001.png}
\caption{Using a learning rate of 0.001: the relationship between the number of neurons in the recurrent
cell to classification accuracy for GRU and LSTM at each dropout rate.}
\label{fig:learn_rate_0.001}
\end{figure}
Dropout rates seem to have had much less of an effect at this higher
learning rate, however, the highest results were still mostly obtained by
the lower dropout rates. The cell type also seemed to have much less of an
effect. Perhaps the problems training with this higher learning rate also
reduced overfitting.
Following from these observations from the grid search, recursive neural
networks were trained using a learning rate of 0.0001 over 10 times as many
epochs using a variety of lower dropout rates. These models did eventually
overfit much more severely - achieving training set accuracy of around 90\%
but only 70\% test accuracy and log-loss of 1.09. Small changes in the
dropout rate had very little effect upon the performance of the RNN. The
confusion matrix is given in table \ref{tab:rnn_res}
\begin{table}[h]
\centering
\begin{tabular}{m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{0cm}}
\multicolumn{1}{m{1cm}}{} & \multicolumn{1}{m{1cm}}{EAP} & \multicolumn{1}{m{1cm}}{HPL} & \multicolumn{1}{m{1cm}}{MWS} &\\[5pt]
\cline{2-4}
EAP & 721 & 261 & 120 & \\[5pt]
\cline{2-4}
HPL & 163 & 1218 & 219 & \\[5pt]
\cline{2-4}
MWS & 120 & 311 & 783 & \\[5pt]
\cline{2-4}
\end{tabular}
\caption{Final results for RNN\\Loss 1.09 Accuracy: 70\% }
\label{tab:rnn_res}
\end{table}
\subsection{Beta Distribution Bag-of-words}
\label{sec:bow_beta}
The bag-of-words model described in Section \ref{sec:bag_of_words} has some substantial issues regarding the probability estimates $p_w^a$. Most notably is its bias towards words which occur only very few times, and only with a single author.
A good example is the word "marker". In the training set Lovecraft uses it exactly once, but both Shelly and Poe don't use it at all. Is it fair for this to be considered a "signature word" for Lovecraft? It could simply be that each author uses it with a similar frequency and it just happens that it was only observed once.
A better estimate for the usage probabilities can be provided by beta distributions, which provide a probability distribution for the actual occurrence rate of an event. By making some assumptions and simplifications - such as the usage of words by an author being independent of one another and a uniform prior distribution - it is possible to easily form this revised estimate. While these simplifications are obviously not a complete reflection of the real world, the posterior probability given by this model is likely to still be an improvement over that in Section \ref{sec:bag_of_words}.
The beta distribution is denoted by $Beta(\alpha , \beta)$ and has its mean at $\mathbb{E}(Beta(\alpha, \beta)) = \frac{\alpha}{\alpha + \beta}$. For $n$ random, independent trials with a uniform prior in which some outcome $x$ occurs $s$ times, the posterior probability is given by:
\begin{equation*}
p(x) = Beta(\alpha + s, \beta + n - s)
\end{equation*}
And has expectation:
\begin{equation*}
\mathbb{E}(p(x)) = \frac{\alpha+s}{\alpha + \beta + n}
\end{equation*}
This can be used to form a new estimation for the word frequency for each author that doesn't unduly favour extremely rare words. Additionally, this formula still holds even when a word has not been observed in an author's vocabulary. Because of this the hierarchical scoring system presented in Section \ref{sec:bag_of_words} is not necessary and a direct comparison of probabilities between each author may be made. This also allows loss to be calculated.
Utilising this new metric the confusion matrix, accuracy and loss are as in Table \ref{tab:beta_res}.
\begin{table}[h]
\centering
\begin{tabular}{m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{0cm}}
\multicolumn{1}{m{1cm}}{} & \multicolumn{1}{m{1cm}}{EAP} & \multicolumn{1}{m{1cm}}{HPL} & \multicolumn{1}{m{1cm}}{MWS} &\\[5pt]
\cline{2-4}
EAP & 4657 & 675 & 968 & \\[5pt]
\cline{2-4}
HPL & 286 & 3914 & 333 & \\[5pt]
\cline{2-4}
MWS & 285 & 316 & 4229 & \\[5pt]
\cline{2-4}
\end{tabular}
\caption{Results for bag-of-words classifier with beta distributions, stemming and lemmatisation enabled.\\Loss 3.34 Accuracy: 82\% }
\label{tab:beta_res}
\end{table}
\subsection{Bag-of-words Feature Kernel Neural Net}
\label{sec:bow_nn}
The bag-of-words predictors described in Sections \ref{sec:bag_of_words} and \ref{sec:bow_beta} perform excellently in terms of accuracy, but the loss (observable only when utilising beta distributions) is very poor. Conversely the windowed neural net predictor in Section \ref{sec:win_nn} has a comparably poor accuracy but performs vastly better with regards to loss.
This is because the neural network is directly trained against the loss function, whereas the bag-of-words models are not. Combining these two approaches -- the simplicity of a bag-of-words model with the loss-optimising nature of a neural net -- is likely to provide the best of both worlds with regards to loss and accuracy metrics.
The beta-distribution bag-of-words predictor utilises solely the product of probabilities to form its prediction. While a classifier could be built on top of this alone, more metrics may be offered as input to improve performance. In this case the following are provided to the input layer, for each author:
\begin{itemize}
\item The mean log probability of all words \footnote{For any given sentence, this is simply a length-scaled version of the log likelihood. However by taking the mean, the magnitude of these variables is not dependent on sentence length}
\item The max and min log probability of all words
\item The standard deviation of log probabilities of all words
\item The number of "missed" words in the sentence \footnote{"Missed" words are those not observed as having been used by an author in the training set. Whilst they are still assigned a probability as in Section \ref{sec:bow_beta}, their record as having been missed is retained}
\end{itemize}
Additionally, the sentence length is included based on the observations in Section \ref{sec:sentence_lengths}. This provides a 16 neuron input-layer.
The confusion matrix, accuracy and loss for this model may be seen in Table \ref{tab:bow_nn_res}. The results presented utilise a perceptron with no hidden layers; adding hidden layers does not improve performance. Noticeable overfitting is observed with this architecture. Its training loss is $<0.2$, and its training accuracy $>90\%$. Attempts to overcome this such as compressing the model through a 2-neuron layer and increasing batch sizes were not successful.
\begin{table}[h]
\centering
\begin{tabular}{m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{0cm}}
\multicolumn{1}{m{1cm}}{} & \multicolumn{1}{m{1cm}}{EAP} & \multicolumn{1}{m{1cm}}{HPL} & \multicolumn{1}{m{1cm}}{MWS} &\\[5pt]
\cline{2-4}
EAP & 3747 & 550 & 236 & \\[5pt]
\cline{2-4}
HPL & 419 & 5300 & 581 & \\[5pt]
\cline{2-4}
MWS & 286 & 582 & 3962 & \\[5pt]
\cline{2-4}
\end{tabular}
\caption{Results for bag-of-words feature kernel neural net, stemming and lemmatisation enabled.\\Loss 0.46 Accuracy: 83\% }
\label{tab:bow_nn_res}
\end{table}
\subsection{Random Forest}
\label{sec:forest_extension}
Following the success of the bag-of-words neural net, the same features were
used with a random forest: achieving 78.8\% accuracy even with only 4
estimators. From 80 estimators upwards\footnote{Tested up to 40000 estimators}
the accuracy does not go above 83\%.
\begin{table}[h]
\centering
\begin{tabular}{m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{0cm}}
\multicolumn{1}{m{1cm}}{} & \multicolumn{1}{m{1cm}}{EAP} & \multicolumn{1}{m{1cm}}{HPL} & \multicolumn{1}{m{1cm}}{MWS} &\\[5pt]
\cline{2-4}
EAP & 3742 & 549 & 242 & \\[5pt]
\cline{2-4}
HPL & 403 & 5335 & 562 & \\[5pt]
\cline{2-4}
MWS & 288 & 600 & 3942 & \\[5pt]
\cline{2-4}
\end{tabular}
\caption{Final results for Random Forest (80 trees)\\Accuracy: 83\% }
\label{tab:forest_res}
\end{table}
\subsection{SVM}
\label{sec:svm_extension}
The bag-of-words features were also used to train an SVM. RBF, polynomial, and
linear kernels achieved 83\% accuracy. A sigmoid kernel managed only 67\%.
The SVM, random forest and probabilistic neural network all top out at 83\%
accuracy so this is likely to be the highest possible accuracy given the bag
of words features.
\begin{table}[h]
\centering
\begin{tabular}{m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{0cm}}
\multicolumn{1}{m{1cm}}{} & \multicolumn{1}{m{1cm}}{EAP} & \multicolumn{1}{m{1cm}}{HPL} & \multicolumn{1}{m{1cm}}{MWS} &\\[5pt]
\cline{2-4}
EAP & 3768 & 517 & 248 & \\[5pt]
\cline{2-4}
HPL & 421 & 5288 & 591 & \\[5pt]
\cline{2-4}
MWS & 292 & 608 & 3930 & \\[5pt]
\cline{2-4}
\end{tabular}
\caption{Final results for SVM (RBF Kernel)\\Accuracy: 83\% }
\label{tab:forest_res}
\end{table}
\subsection{Ensemble}
\label{sec:ensemble}
It was expected that the windowed neural network and the bag of words neural
network would be reasonably uncorrelated because they used such different
input features. To take advantage of this they were combined into an
ensemble using a single densely connected perceptron layer with sigmoid
activation to combine their weights. However, when tested, this ensemble
performed the same as the bag of words neural network, presumably having
learnt to reproduce its output.
\subsection{Translation}
\label{sec:translation}
The dataset used for training in this work is relatively small, and so any way to meaningfully augment it may be beneficial.
In other natural language processing tasks, this is often achieved using translation to and from other languages. This creates new sentences that preserve general structure and sentiment, but differ in precise word choice.
.
If the key to differentiating authors were precise patterns of word usage -- such as hallmark turns of phrase -- then this augmentation would likely be detrimental, as it would not preserve that sort of detail. However the success of the bag-of-words classifiers suggests that sentiment, subject and tone may be a better predictor of the author, in which case dataset augmentation may prove useful.
This was achieved using a function by Pavel Ostavakov \cite{Ostyakov2018}. \textit{TextBlob} is used to translate sentences to German, Spanish and French and then back to English.
Using the augmented dataset did not improve classification accuracy, but did slightly improve the loss score in the Bag-of-words Feature Kernal Neural Net. These results may be seen in Table \ref{tab:bow_nn_res_tran}.
\begin{table}[h]
\centering
\begin{tabular}{m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{0cm}}
\multicolumn{1}{m{1cm}}{} & \multicolumn{1}{m{1cm}}{EAP} & \multicolumn{1}{m{1cm}}{HPL} & \multicolumn{1}{m{1cm}}{MWS} &\\[5pt]
\cline{2-4}
EAP & 14952 & 2193 & 987 & \\[5pt]
\cline{2-4}
HPL & 1669 & 21372 & 2159 & \\[5pt]
\cline{2-4}
MWS & 1046 & 2330 & 15944 & \\[5pt]
\cline{2-4}
\end{tabular}
\caption{Results for bag-of-words feature kernel neural net, stemming and lemmatisation enabled, using a translated enhanced data set.\\Loss 0.43 Accuracy: 83\% }
\label{tab:bow_nn_res_tran}
\end{table}