title | author | date | tags | abstract | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tutorial 4: Intent Recognition |
|
2018-06-19 |
|
Lab 4: Intent Recognition |
In this exercice, we consider a Human-Robot Interaction situation in which a human is evaluating actions performed by the Kismet robot: by showing approval or expressing prohibition. The initial corpus contains a total of 1002 American English utterances of varying linguistic content produced by three female speakers in three classes of affective communicative intents (aproval, attention, prohibition (weak)). The affective intents sound acted and are expressed rather strongly. The speech recordings are of variable length, mostly in the range of 1.8 - 3.25s. We extracted prosodic features such as the fundamental frequency (denoted by time, value
entries on each line).
The aim of this exercice is to develop a human feedback classifier: positive (approval)/negative (prohibition). This classifier might be used to teach the robots and/or to guide the robot's learning.
Development of the human feedback classifier:
- Extraction of prosodic features (f0 and energy)
- Extraction of functionals (statistics) that will be our features in the supervised learning task: mean, maximum, range, variance, median, first quartile, third quartile, mean absolute of local derivate
- Check functionals for both voiced (i.e.
$f_0 ≠ 0$ ) and unvoiced segments, and then determine which segments are suited for the approach. - Build two databases by randomly extracting examples: a learning database (60% of the data points) and a test one
- Train a classifer (
$k$ -nearest neighbors (k-NN) algorithm)) - Discuss the performance of the classifier
We found that, in the given data, the file names are contain either at
, pw
or ap
, which are likely to represent the intention classes "Attention", "Prohibition Weak" and "Approval" respectively.
Since the aim of this exercice is to develop a human feedback classifier for positive ("Approval") / negative ("Prohibition") intentions, we extract the files labelled as pw
and ap
and exclude the files labelled as at
.
For our project, we cooperated online using Google Colab. The codes for extracting the files labelled as pw
and ap
and their the prosodic features is shown below:
import urllib.request
import numpy as np
import pandas as pd
from google.colab import files as google_files
import itertools
import matplotlib.pyplot as plt
import ggplot
def list_from_URL(file_URL, function_applied=None):
lines_bytes = urllib.request.urlopen(file_URL).readlines()
lines = []
for line in lines_bytes:
line = line.decode("utf-8").rstrip()
if function_applied is not None:
line = function_applied(line)
lines.append(line)
return lines
filenames = list_from_URL('https://raw.githubusercontent.com/youqad/Neurorobotics_Intent-Recognition/master/filenames.txt')
filenames = list(set(filenames))
files = []
indices = []
for file in filenames:
URL_f0 = 'https://raw.githubusercontent.com/youqad/Neurorobotics_Intent-Recognition/master/data_files/{}.f0'.format(file)
file_dicts = [{key:val for key, val in zip(['time', 'f0'], map(float, l.split()))} for l in list_from_URL(URL_f0)]
URL_en = 'https://raw.githubusercontent.com/youqad/Neurorobotics_Intent-Recognition/master/data_files/{}.en'.format(file)
for l, d in zip(list_from_URL(URL_en), file_dicts):
d["file"] = file
d["en"] = float(l.split()[1])
d["label"] = file[-2:]
files.extend(file_dicts)
# How `files` looks like:
# # files = [
# # {"file": "cy0001at", "time": 0.02, "f0": 0., "en": 0.},
# # {"file": "cy0001at", "time": 1.28, "f0": 0., "en": 0.},
# # ...
# # {"file": "li1450at", "time": 0.02, "f0": 0., "en": 0.},
# # {"file": "li1450at", "time": 1.56, "f0": 404., "en": 65.}
# # ]
pd.DataFrame(files).to_csv('data.csv', encoding='utf-8', index=False) # To reuse it next time
#google_files.download('data.csv')
# loading training data, once it has been saved on the repository
df = pd.read_csv('https://raw.githubusercontent.com/youqad/Neurorobotics_Intent-Recognition/master/data.csv').set_index('file')
df1 = df.loc[df['label'] != 'at']
2. Extraction of functionals (statistics) : mean, maximum, range, variance, median, first quartile, third quartile, mean absolute of local derivate
We calculated the mean, max, range, variance, median. first quartile, third quartile and mean absolute of local derivate for each en
and f0
file. Using a pandas DataFrame, the code to extract these functionals is the following:
list_features = ['mean',
'max',
('range', lambda x: max(x)-min(x)),
'var',
'median',
('1st_quantile', lambda x: x.quantile(.25)),
('3rd_quantile', lambda x: x.quantile(.75)),
('mean_absolute_local_derivate', lambda x: abs(x.diff()).mean())
]
df1.groupby('file')['f0','en'].agg(list_features).head()
Table f0
and en
files respectively:
file |
mean |
max |
range |
var |
median |
1st_quantile |
3rd_quantile |
mean_absolute_local_derivate |
---|---|---|---|---|---|---|---|---|
cy0007pw | 92.3 | 257.0 | 257.0 | 10372.5 | 0.0 | 0.0 | 189.5 | 13.7 |
cy0008pw | 78.4 | 250.0 | 250.0 | 9930.1 | 0.0 | 0.0 | 192.0 | 26.4 |
cy0009pw | 69.1 | 243.0 | 243.0 | 8927.2 | 0.0 | 0.0 | 182.3 | 12.9 |
cy0010pw | 29.2 | 221.0 | 221.0 | 4696.2 | 0.0 | 0.0 | 0.0 | 15.27 |
cy0011pw | 110.7 | 230.0 | 230.0 | 9290.4 | 172.0 | 0.0 | 192.5 | 7.5 |
file |
mean |
max |
range |
var |
median |
1st_quantile |
3rd_quantile |
mean_absolute_local_derivate |
---|---|---|---|---|---|---|---|---|
cy0007pw | 52.3 | 71.0 | 71.0 | 228.5 | 52.0 | 41.0 | 66.0 | 2.9 |
cy0008pw | 47.7 | 70.0 | 70.0 | 321.9 | 43.0 | 41.0 | 64.5 | 3.9 |
cy0009pw | 49.5 | 74.0 | 74.0 | 260.8 | 42.0 | 40.8 | 66.0 | 3.5 |
cy0010pw | 46.1 | 77.0 | 77.0 | 165.8 | 42.0 | 41.0 | 50.8 | 3.3 |
cy0011pw | 53.7 | 71.0 | 71.0 | 258.1 | 62.0 | 41.3 | 66.0 | 2.3 |
3. Check functionals for both voiced (i.e. $f_0$ ≠ $0$ ) and unvoiced segments. Which segments are suited for the approach?
We extract voiced segments by only keeping the data whose
voiced = df1.loc[df1['f0']!=0].groupby('file')['f0','en'].agg(list_features)
voiced.head() # to visualize the first rows of the dataframe
Similarly, we extract the unvoiced segments by looking for the data whose
unvoiced = df1.loc[df1['f0']==0].groupby('file')['en'].agg(list_features)
unvoiced.head()
The first f0
and en
files are shown in Table
file |
mean |
max |
range |
var |
median |
1st_quantile |
3rd_quantile |
mean_absolute_local_derivate |
---|---|---|---|---|---|---|---|---|
cy0007pw | 200.3 | 257.0 | 90.0 | 675.9 | 191.0 | 182.5 | 213.0 | 5.9 |
cy0008pw | 200.0 | 250.0 | 83.0 | 538.4 | 198.5 | 179.5 | 210.0 | 10.4 |
cy0009pw | 194.4 | 243.0 | 77.0 | 446.9 | 190.0 | 180.0 | 209.0 | 7.2 |
cy0010pw | 186.1 | 221.0 | 67.0 | 465.3 | 178.5 | 171.3 | 204.3 | 6.5 |
cy0011pw | 191.9 | 230.0 | 66.0 | 314.8 | 190.0 | 179.0 | 204.0 | 4.1 |
file |
mean |
max |
range |
var |
median |
1st_quantile |
3rd_quantile |
mean_absolute_local_derivate |
---|---|---|---|---|---|---|---|---|
cy0007pw | 65.9 | 71.0 | 16.0 | 17.8 | 66.0 | 63.5 | 70.0 | 1.7 |
cy0008pw | 61.0 | 70.0 | 70.0 | 242.7 | 66.0 | 61.5 | 68.0 | 5.8 |
cy0009pw | 67.3 | 74.0 | 20.0 | 17.9 | 68.0 | 66.0 | 70.0 | 2.9 |
cy0010pw | 65.8 | 77.0 | 25.0 | 50.5 | 64.0 | 62.0 | 70.8 | 4.0 |
cy0011pw | 65.3 | 71.0 | 19.0 | 14.7 | 65.0 | 63.0 | 68.0 | 0.9 |
The first en
files are shown in Table
file |
mean |
max |
range |
var |
median |
1st_quantile |
3rd_quantile |
mean_absolute_local_derivate |
---|---|---|---|---|---|---|---|---|
cy0007pw | 40.7 | 58.0 | 58.0 | 113.6 | 41.0 | 40.5 | 43.5 | 3.7 |
cy0008pw | 39.2 | 58.0 | 58.0 | 189.6 | 42.0 | 41.0 | 43.0 | 5.2 |
cy0009pw | 39.6 | 56.0 | 56.0 | 119.6 | 41.0 | 40.0 | 42.0 | 3.6 |
cy0010pw | 42.4 | 68.0 | 68.0 | 101.4 | 41.0 | 40.0 | 43.0 | 3.1 |
cy0011pw | 37.8 | 51.0 | 51.0 | 150.9 | 41.0 | 40.0 | 42.0 | 4.1 |
To judge which segments are better for the approach, we should check how separable (with respect to the label ap
or pw
) the data is in each situation. We first look at the overall statistics of the two segments for the two classes "Approval" and "Prohibition Weak". The results are shown in Table
Table $6$: Statistics of "Approval" (ap
) and "Prohibition Weak" (pw
) files for voiced and unvoiced segments
segments |
file |
class |
mean |
max |
range |
var |
median |
1st_quantile |
3rd_quantile |
mean_absolute_local_derivate |
---|---|---|---|---|---|---|---|---|---|---|
voiced | f0 | ap | 289.5 | 597.0 | 521.0 | 11013.0 | 272.0 | 199.0 | 370.5 | 24.9 |
voiced | f0 | pw | 192.4 | 597.0 | 522.0 | 2702.1 | 191.0 | 170.0 | 218.0 | 14.4 |
voiced | en | ap | 73.3 | 93.0 | 93.0 | 88.2 | 74.0 | 68.0 | 79.0 | 3.5 |
voiced | en | pw | 71.6 | 91.0 | 91.0 | 84.7 | 72.0 | 65.0 | 79.0 | 3.0 |
unvoiced | f0 | ap | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
unvoiced | f0 | pw | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
unvoiced | en | ap | 46.4 | 94.0 | 94.0 | 239.2 | 43.0 | 41.0 | 55.0 | 3.9 |
unvoiced | en | pw | 47.6 | 91.0 | 91.0 | 231.1 | 47.0 | 40.0 | 58.0 | 3.5 |
Table ap
and pw
files:
- the
en
functionals of voiced and unvoiced segments seem very close to one another between the two types of classes (ap
andpw
) - and of course, the statistics of
f0
files for unvoiced segments are all$0$ .
Thus, so far from the results of Table 6,
This intuition is backed up by what we do next: by plotting the mean absolute local of derivate as a function of the variance for each type (voiced and unvoiced) of segments, we see that the data points seem more separable (based on their class) in the voiced case than in the unvoiced one:
Figure 1 - Variance and mean absolute local derivate of $f_0$ for voiced segmentsFigure 1 suggests that the approval and prohibition classes can be reasonably well separated using their
However, by doing the same for unvoiced segments, we see that the data is not as easily separable:
Figure 2 - Energy variance and mean absolute local derivate for unvoiced segmentsFigure 2 shows that the data points cannot be separated well with respect to their classes using the variance and mean absolute local derivate of energy for unvoiced segments.
So on the whole, by plotting our randomly selected data, we found that voiced segments seem better suited for the classification task at hand.
4. Build two databases by randomly extracting examples : Learning database ($60$ %) and Test database
We randomly extract
def train_test(df=df1, train_percentage=.6, seed=1):
voiced = df.loc[df['f0']!=0].groupby('file')['f0','en'].agg(list_features)
unvoiced = df.loc[df['f0']==0].groupby('file')['en'].agg(list_features)
X, Y = {}, {}
X['voiced'], Y['voiced'] = {}, {}
X['unvoiced'], Y['unvoiced'] = {}, {}
X['voiced']['all'] = np.array(df.groupby('file')['f0','en'].agg(list_features))
Y['voiced']['all'] = np.array(df.loc[df['f0']!=0].groupby(['file']).min().label.values)
X['unvoiced']['all'] = np.array(unvoiced)
Y['unvoiced']['all'] = np.array(df.loc[df['f0']==0].groupby(['file']).min().label.values)
np.random.seed(seed)
for type in ['voiced', 'unvoiced']:
n = len(X[type]['all'])
ind_rand = np.random.randint(n, size=int(train_percentage*n)) # random indices
train_mask = np.zeros(n, dtype=bool)
train_mask[ind_rand] = True
X[type]['train'], X[type]['test'] = X[type]['all'][train_mask], X[type]['all'][~train_mask]
Y[type]['train'], Y[type]['test'] = Y[type]['all'][train_mask], Y[type]['all'][~train_mask]
return X, Y
X1, Y1 = train_test()
We used two implementations of a k-NN classifier, used both on voiced and unvoiced data:
- Scikit Learn's kNN classifier
- our own implementation
We first use Scikit Learn's kNN classifier to have a firsthand idea about the classification results. The code for Scikit Learn's kNN classifier (with
# Scikit Learn's kNN classifier:
# Just to test, but we will implement it ourselves of course!
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
def sklearn_knn(k, X, Y):
for type in ['voiced', 'unvoiced']:
kNN = KNeighborsClassifier(n_neighbors=k)
kNN.fit(X[type]['train'], Y[type]['train'])
print("Accuracy score for {}: {:.2f}".format(type, accuracy_score(Y[type]['test'], kNN.predict(X[type]['test']))))
sklearn_knn(3, X1, Y1)
The classification result of Scikit Learn's kNN classifier indicates that:
- Accuracy score of voiced data =
$0.91$ - Accuracy score of unvoiced data =
$0.61$
We can see that the accuracy of voiced data is 91%, much higher than the accuracy of unvoiced data which is only 61% (which confirms our previous intuition).
Then, we implement our own algorithm that we apply for both voiced and unvoiced data using the following code:
# Our own implementation!
from scipy.spatial.distance import cdist
from sklearn.metrics import confusion_matrix
from collections import Counter
def kNN(k, X, Y, labels=["pw", "ap"]):
# auxiliary function: label prediction (by majority vote)
# based on the nearest neighbors
def predicted_label(ind_neighbors):
label_neighbors = tuple(Y['train'][ind_neighbors])
return Counter(label_neighbors).most_common(1)[0][0]
# Pairwise distances between test and train data points
dist_matrix = cdist(X['test'], X['train'], 'euclidean')
y_predicted = []
for i in range(len(X['test'])):
ind_k_smallest = np.argpartition(dist_matrix[i, :], k)[:k]
y_predicted.append(predicted_label(ind_k_smallest))
# Confusion matrix: C[i, j] is the number of observations
# known to be in group i but predicted to be in group j
return confusion_matrix(Y['test'], np.array(y_predicted), labels=labels)
plt.figure()
cm = kNN(3, X1['voiced'], Y1['voiced'])
plot_confusion_matrix(cm, classes=["pw", "ap"],
title='Confusion matrix, with normalization')
plt.show()
cm2 = kNN(3, X1['unvoiced'], Y1['unvoiced'])
plot_confusion_matrix(cm2, classes=["pw", "ap"],
title='Confusion matrix, with normalization')
plt.show()
The result of our implementation is shown in Figure 3 and Figure 4:
Figure 3 - Confusion matrix for voiced segmentsFigure 3 shows that, using voiced segments as training data, the accuracy of the classifier is very high:
-
$92$ % accuracy for the prohibition class - and
$91$ % accuracy for the approval class
Figure 4 demonstrates that, using unvoiced segments as training data, the accuracy of the classifier is worse:
- ony
$49$ % for the prohibition class - and
$77$ % for the approval class
The results above, again, support that voiced segments are better for training the classifier, and that they enable us to faithfully (with more than
6. Evaluate and discuss the performance of the classifier. You will discuss the relevance of the parameters (f0 et energy), the role of the functionals, the role of $k$ , ratio of Learning/Test databases, random design of databases.
As shown in question 1.5, the classifier has relatively good results with voiced training data: more than
This better accuracy for voiced training data (which encompass the
We have plotted some
By implementing the k-NN classifier using different pw
data can be classified more accurately, but the ap
data will be classified less accurately. This may be due to larger values of
By testing the accuracy of classifier with different ratio values, we found that the classifier becomes less accurate for lower values of the ratio (due to a lack of training) and for larger values of the ratio as well (due to overfitting of the data).
Here, we use the same method than in question 1.1, using the pandas DataFrame df
instead of df1
:
df.groupby('file')['f0','en'].agg(list_features).head()
We then develop a classifier which takes into account the third class (attention): this one corresponds to data points which are neither approvals nor prohibitions.
X, Y = train_test(df=df)
sklearn_knn(3, X, Y)
plt.figure()
cm = kNN(3, X['voiced'], Y['voiced'], labels=["pw", "ap", "at"])
plot_confusion_matrix(cm, classes=["pw", "ap", "at"],
title='Confusion matrix, with normalization')
plt.show()
cm2 = kNN(3, X['unvoiced'], Y['unvoiced'], labels=["pw", "ap", "at"])
plot_confusion_matrix(cm2, classes=["pw", "ap", "at"],
title='Confusion matrix, with normalization')
plt.show()
As before, the classifier is more accurate for voiced segments than for unvoiced ones:
-
for voiced segments:
- more than
$80$ % of the prohibition sounds -
$56$ % of the approval ones - and
$66$ % of the attention ones
were correctly classified.
- more than
-
for unvoiced segments: the results are, as exepected, worse: only
$51%$ of the prohibition,$45%$ of the approval and$37%$ of the attention sounds were correctly recognized.
On the whole:
- prohibition sounds are definitely the most recognizable ones, as indicated by the high corresponding accuracy rate
- however, the attention sounds seem to mess the approval class recognition up: the classifier has a harder time recognizing the approval sounds now that there are additional attention sounds. This has a rather intuitive explanation: one can conceive that it is easy to mistake an approval sentence with an attention-drawing one, as the tone of the voice generally rises in a similar fashion.