-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy path68-supervised.Rmd
781 lines (621 loc) · 26.3 KB
/
68-supervised.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
# Supervised learning {#sec-sl}
The goal of this last chapter about machine learning will be
supervised learning, and in particular **classification**. In this
chapter, you will learn
- learn what classification is;
- learn about labelled, unlabelled, training and testing data;
- learn about and apply k-nearest neighbour, a simple non-parametric
classifier;
- see why and how to use cross-validation;
- and learn about model complexity and generalisation.
## Introduction
Often, when faced with omics data, we have annotations for some of the
samples (or features) we have acquired:
- Among the 100s of patient tumours that were assayed using RNA
sequencing or microarrays, we know the grade of the tumour for about
half. We want to predict the grade of the other half using the gene
expression profiles.
- We have performed a spatial proteomics experiment such as in
Christoforou *et al.* [@Christoforou:2016] (see section
\@ref(sec-dimred02)) and know the sub-cellular localisation of some
proteins. We want to predict the sub-cellular localisation of the
other proteins using the protein profiles.
In both of these examples, the quantitative data are the data that we
want to use to predict properties about samples or features; these are
called the **predictors**. The grade of the samples in the first
example and the protein sub-cellular localisation in the second one
are the **labels** that we have in some cases, and want to predict
otherwise. We can thus split our data in two parts, depending whether
we have labels, or whether we want to predict them. The former are
called **labelled**, and the latter **unlabelled**.
```{r, echo = FALSE, fig.cap = "In supervised learning, the data are split in labelled or unlabelled data. The same applies when some of the columns are labelled.", fig.width = 12, fig.height = 5}
par(mfrow = c(1, 2), mar = c(0, 1, 2, 0), oma = c(0, 0, 0, 0))
rWSBIM1322:::sl_input_rows(main = "All data")
text(rep(0.8, 8), (1:8) + 0.5, 1:8)
rWSBIM1322:::blank_plot()
title(main = "Labelled and unlabelled data")
rWSBIM1322:::quant_data()
rect(9.2, 8, 10.2, 9, col = "steelblue")
rect(9.2, 7, 10.2, 8, col = "steelblue")
rect(9.2, 6, 10.2, 7, col = "red")
rect(9.2, 5, 10.2, 6, col = "red")
rect(9.2, 4, 10.2, 5, col = "green")
text(rep(0.8, 8), (1:8) + 0.5,
c(2, 4, 7, 5, 1, 6, 3, 8))
```
In the figure above, the labels represent categories that need to be
inferred from the predictors. This class of supervised learning is
called **classification**. Classification are also split into **binary
classification** when there are only two classes, or **multi-label**
problem when, like above, there are more than two. The latter is a
generalisation of the binary task. When the annotations are continuous
values, the situation is referred to as a **regression** problem.
`r msmbstyle::question_begin()`
Load the `giris2` data from the `rWSBIM1322` package (requires version
>= 0.1.13). Identify the labelled and unlabelled data; how many are
there respectively.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
library("rWSBIM1322")
data(giris2)
summary(giris2)
table(is.na(giris2$GRADE))
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Visualise the `giris2` data on a PCA plot, highlighting the labels (or
absence thereof). From the visualisation, do you think classifying the
unlabelled data will be difficult or easy? Defend your answer.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
pca <- prcomp(giris2[, -5], scale = TRUE)
cls <- as.character(giris2$GRADE)
cls[is.na(cls)] <- "unknown"
factoextra::fviz_pca_ind(pca, col.ind = cls)
```
`r msmbstyle::solution_end()`
## K-nearest neighbours classifier
In this chapter, we'll use a simple, but useful classification
algorithm, k-nearest neighbours (kNN) to classify the *giris2*
patients. We will use the `knn` function from the
`r CRANpkg("class")` package.
K-nearest neighbours works by directly measuring the (Euclidean)
distance between observations and inferring the class of unlabelled data
from the class of its nearest neighbours. In the figure below, the
unlabelled instances *1* and *2* will be assigned classes *A* (blue)
and *B* (red) as their closest neighbours are red and blue,
respectively.
```{r knnex, echo=FALSE, fig.cap="Schematic illustrating the k nearest neighbours algorithm."}
p1 <- c(0, 0)
p2 <- c(0.7, 0.5)
x1 <- rbind(c(0.2, 0.2),
c(-0.3, -0.8),
c(-0.2, 1.3))
x2 <- rbind(c(1, 1),
c(0.5, 0.7))
x3 <- c(1.5, -.9)
x <- rbind(p1, p2, x1, x2, x3)
col <- c("black", "black",
rep("steelblue", 3),
rep("red", 2),
"darkgreen")
plot(x, pch = 19, col = col,
cex = 5, xlab = "", ylab = "",
xaxt = "n", yaxt = "n")
grid()
text(p1[1], p1[2], "1", col = "white", cex = 2)
text(p2[1], p2[2], "2", col = "white", cex = 2)
for (i in 1:3)
segments(p1[1], p1[2],
x1[i, 1], x1[i, 2],
lty = "dotted",
col = "steelblue")
segments(p2[1], p2[2],
x1[1, 1], x1[1, 2],
lty = "dotted",
col = "steelblue")
for (i in 1:2)
segments(p2[1], p2[2],
x2[i, 1], x2[i, 2],
lty = "dotted",
col = "red")
legend("topright",
legend = LETTERS[1:3],
pch = 19,
col = c("steelblue", "red", "darkgreen"),
cex = 2,
bty = "n")
```
Typically in machine learning, there are two clear steps, where one
first **trains** a model and then uses the model to **predict** new
outputs (class labels in this case). In kNN, these two steps are
combined into a single function call to `knn`.
`r msmbstyle::question_begin()`
Separate the `giris2` data into two new datasets, one containing the
labelled data that will be used to train the model and named
`giris2_labelled`, and a second one containing the unlabelled data called
`giris2_unlabelled`.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
giris2_labelled <- giris2[!is.na(giris2$GRADE), ]
giris2_unlabelled <- giris2[is.na(giris2$GRADE), ]
```
`r msmbstyle::solution_end()`
The `knn` function takes, among others[^knnargs], the following arguments:
[^knnargs]: we will see additional ones later.
1. the labelled predictors, that will be used to *train* the model,
2. the unlabelled predictors, on which the model will be applied (see
below why this is called *test*),
3. the labels (the length of this vector must match the number of rows
of the labelled predictors).
`r msmbstyle::question_begin()`
- Apply the kNN classifier on the `giris2` prepared in the previous
exercise.
- What is the class of the output?
- How many of the unlabelled data have been assigned to the different
grades?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
library("class")
knn_res <- knn(giris2_labelled[, -5],
giris2_unlabelled[, -5],
giris2_labelled[, 5])
class(knn_res)
table(knn_res)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Visualise the results of the kNN classification on a PCA plot and
interpret the results based on the first PCA.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, fig.cap = "Visualisation of the labelled and unlabelled data before (top) and after (bottom) classification.", fig.height = 10}
giris2_res <- giris2
giris2_res[is.na(giris2$GRADE), 5] <- knn_res
p1 <- factoextra::fviz_pca_ind(pca, col.ind = cls)
p2 <- factoextra::fviz_pca_ind(pca, col.ind = giris2_res$GRADE)
library("patchwork")
p1 / p2
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
- Can you identify some questionable results? Explore the results for
patient 167, that was assigned group B.
- To do so, calculated the distances between sample 167 and all other
labelled data.
- Compare the labels of its 15 closest labelled data points.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, message = FALSE}
## all pairwise distances
dd <- as.matrix(dist(giris2[, -5]))
## extract only those for sample 167
giris2$dist_167 <- dd[167, ]
## look at the 15 nearest neighbours
library("tidyverse")
vote <- giris2 %>%
head(150) %>%
arrange(dist_167) %>%
head(15) %>%
mutate(vote = cumsum(count = ifelse(GRADE == "C", 1, -1))) %>%
mutate(grade = ifelse(vote <= 0, "B", "C"))
vote
vote %>%
ggplot(aes(x = 1:15, y = vote)) +
xlab("Number of nearest neighbours") +
geom_point(aes(colour = grade), size = 3) + geom_line() +
geom_hline(yintercept = 0)
```
`r msmbstyle::solution_end()`
As we have seen, the number of nearest neighbours *k* has an important
influence on the classification results. We can now refine our
understanding of the `knn` function; it has the following arguments:
1. the labelled predictors, that will be used to *train* the model,
2. the unlabelled predictors, on which the model will be applied (see
below why this is called *test*),
3. the labels (the length of this vector must match the number of rows
of the labelled predictors),
4. the number of neighbours to use,
5. when set to `TRUE`, the `prob` argument also return the proportion
of votes in favour of the assigned class.
`r msmbstyle::question_begin()`
- Repeat the kNN search comparing k=1 (as in our first attempt) and
k=11 and compare the result.
- Which one is correct?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
knn_1 <- knn(giris2_labelled[, -5], giris2_unlabelled[, -5], giris2_labelled[, 5], k = 1)
knn_11 <- knn(giris2_labelled[, -5], giris2_unlabelled[, -5], giris2_labelled[, 5], k = 11)
table(knn_1, knn_11)
```
`r msmbstyle::solution_end()`
## Model selection
There is no way to decide which set of results above, using k=1, 2,
... 11 or any other value for k is better. At least not while
proceeding as above. To be able to make an informed decision about the
**model parameter** k, we need to to measure the **performance** of
the kNN classifier for different values of k. To do so, we are going to
create **training** and **testing** sets using the labelled data. Each
of these will be composed by a certain proportion of the original
labelled data.
Below, we denote the training predictors $X_{tr}$ and labels $Y_{tr}$ and
the testing predictors $X_{te}$ and labels $Y_{te}$.
```{r, echo = FALSE, fig.cap = "Training and testing sets."}
par(mar = c(1, 1, 1, 1))
rWSBIM1322:::blank_plot()
rect(1, 1, 8, 4, col = "orange", border = NA)
rect(8.2, 1, 9.2, 4, col = "orange", border = NA)
rect(1, 4.2, 8, 9, col = "steelblue", border = NA)
rect(8.2, 4.2, 9.2, 9, col = "steelblue", border = NA)
rect(1, 1, 8, 9, lwd = 2)
text(4.5, 9.5, "X", cex = 3)
rect(8.2, 1, 9.2, 9, lwd = 2)
text(8.7, 9.5, "Y", cex = 3)
text(4.5, 7, expression(X[tr]), cex = 2)
text(4.5, 2.5, expression(X[te]), cex = 2)
text(8.7, 7, expression(Y[tr]), cex = 2)
text(8.7, 2.5, expression(Y[te]), cex = 2)
text(9.6, 6.5, expression(training), srt = 90, cex = 1.5)
text(9.6, 2.5, expression(testing), srt = 90, cex = 1.5)
```
We are now going to do the following procedure
1. hide the testing labels $Y_{te}$,
2. train a classifier model using $X_{tr}$ and $Y_{tr}$,
3. apply it on $X_{te}$ to obtain a new $Y_{te}^{predicted}$, and
4. compare $Y_{te}$ to $Y_{te}^{predicted}$.
There are numerous different ways to measure the performance of a
classifier using $Y_{te}$ and $Y_{te}^{predicted}$.
We are going to focus on the **classification accuracy**, counting the
proportion of correct results.
`r msmbstyle::question_begin()`
Choose 100 random labelled data points to define the training
data. Use the 50 others as test data.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
set.seed(1)
i <- sample(150, 100)
giris2_train <- giris2_labelled[i, ]
giris2_test <- giris2_labelled[-i, ]
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Run two kNN classifications, one with k = 1, then one with k = 11 and
compare each to the true results. A good way to assess the results is
to generate a contingency table (using the `table` function) that
tallies matches and mis-matches between the predictions and expected
assignments. Interpret these results.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
test_1 <- knn(giris2_train[, -5], giris2_test[, -5], giris2_train[, 5], k = 1)
test_11 <- knn(giris2_train[, -5], giris2_test[, -5], giris2_train[, 5], k = 11)
table(giris2_test[, 5], test_1)
table(giris2_test[, 5], test_11)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Calculate the classification accuracy of the two classifiers above.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
sum(test_1 == giris2_test[, 5])/length(test_1)
sum(test_11 == giris2_test[, 5])/length(test_11)
```
`r msmbstyle::solution_end()`
There are now two adjustments that we can do improve on the procedure
above:
1. Test all odd value of k from 1 to 11 (or higher, if deemed useful),
to have a better granularity of the model parameter we test.
2. Testing more than on training/testing split. Indeed, relying on a
single random split, we rely on a random configuration, that could
affect that results either overly optimistically, or negatively. We
prefer to repeat the split *a certain number of time* and calculate
an average performance over all splits.
## k-fold cross-validation
There exist several principled ways to split data into training and
testing partitions. Here, we are going to focus on k-fold
cross-validation, where data of size $n$ a repeatedly split into a training
set of size around $\frac{n (k - 1)}{k}$ and a testing set of size around $\frac{n}{k}$.
In practice, the data are split into k partitions of size $\frac{n}{k}$,
and the training/testing procedure is repeated $k$ times using $k - 1$ partition
as training data and 1 partition as testing data.
The figure below illustrates the cross validation procedure, creating
3 folds. One would typically do a 10-fold cross validation (if the
size of the data permits it). We split the data into 3 random and
complementary folds, so that each data point appears exactly once in
each fold. This leads to a total test set size that is identical to
the size of the full annotated dataset but is composed of
out-of-sample predictions (i.e. a sample is never used for training
and testing).
```{r, echo = FALSE, fig.cap = "The data is split into 3 folds. At each training/testing iteration, a different fold is used as test partition (white) and the two other ones (blue) are used to train the model.", out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/xval.png")
```
After cross-validation, all models used within each fold are
discarded, and a new model is built using the whole labelled dataset,
with the best model parameter(s), i.e those that generalised over all
folds. This makes cross-validation quite time consuming, as it takes
x+1 (where x in the number of cross-validation folds) times as long as
fitting a single model, but is essential.
`r msmbstyle::question_begin()`
Use the `createFolds` function from the `caret` package, passing it
the labelled tags, to create 10 folds. Verify that each sample is
present only once.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
folds <- caret::createFolds(giris2_labelled[[5]], k = 10)
folds
all(sort(unlist(folds)) == 1:150)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Focusing now k=11 and using the folds above, calculate the
classification accuracy in each case and compute the average accuracy
for k=11.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
knn_accuracy <- function(fld, k) {
## Uses global variables!
train <- giris2_labelled[-fld, ]
test <- giris2_labelled[fld, ]
test_k <- knn(train[, -5],
test[, -5],
train[, 5],
k)
sum(test_k == test[, 5])/length(test_k)
}
accs <- sapply(folds, knn_accuracy, k = 11)
accs
mean(accs)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Repeat the above for k=1, 3, 5, ... to 30 and plot the average
accuracy as a function of k.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, fig.cap = "Average classification accuracy as a function of the parameter k."}
accs <- tibble(k = seq(1, 30, 2)) %>%
rowwise() %>%
mutate(acc = mean(sapply(folds, knn_accuracy, k))) %>%
ungroup()
accs
accs %>%
ggplot(aes(x = k, y = acc)) +
geom_point() +
geom_line()
```
`r msmbstyle::solution_end()`
As we can see from the last exercise, most values of k give a good
classification accuracy, ranging from 96% to 98%, and it would be
difficult to identify a single best value for k. Indeed, the three
groups we are typing to assign new data to are relatively well
separated and the little uncertainty that we observe is most likely
due to some instances that lie at the boundary between classes B and C.
There exist a variety of packages (for example `caret`, `tidymodels`
or `mlr`) that automate model and parameter selection using
cross-validation procedures as illustrated above, so that in practice,
these tasks can be automated. This becomes particularly useful when
more than one model parameter (called hyper-parameters) need to tuned,
different classifiers need to be assessed, and/or several model
performance metrics are to be computed.
## Variance and bias trade-off
A recurrent goal in machine learning is to find a trade-off between
variance and bias when fitting models. It is important to build a
good model using the data at hand, i.e. that learns from the data, but
not too much. If the model is too specific for the data that was used
to build it (the model would be said to be **over fitting the data**)
and would not be applicable to new data (the model would have high
bias). On the other hand, a model that is too general, that wouldn't
be applicable enough to the data, or type of data at hand[^varex], it
would perform poorly, and hence lead to high variance.
[^varex]: In our example, this would equate to the expression of the
four genes of interest in a specific population of cancer
patients, for example.
This variance and bias trade-off can be illustrated in different
ways. The figure below (reproduced with permission from [@ISLR]) shows
the prediction error as a function of the model complexity. In our
example above, we have used accuracy, a metric that we want to
optimise, while in the prediction error is a measure to be minimised.
```{r, echo = FALSE, fig.cap = "Effect of model complexity on the prediction error (lower is better) on the training and testing data. The former decreases for lower values of $k$ while the test error reaches a minimum around $k = 10$ before increasing again. Reproduced with permission from James *et al.* 2014)", out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/ISL-2_17.png")
```
The model complexity represent the ability of
the model (or model parameter) to learn/adapt to the training data. In
our case, the complexity would be illustrated by$\frac{1}{k}$: when k=1,
the model becomes very complex as it adapts to every single
neighbour. The other extreme, when k becomes large, the classification
will tend to using an average of so many neighbours that doesn't
reflect any specificity of our data.
```{r, echo = FALSE, fig.cap = "The kNN classifier using k=1 (left, solid classification boundaries) and k=100 (right, solid classification boundaries) compared the Bayes decision boundaries (see original material for details). Reproduced with permission from James *et al.* 2014).", out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/ISL-2_16.png")
```
This last figure, reproduced from [@MSMB], frames the variance-bias
trade-off as one between accuracy and precision. An over-fit model with
high bias and low variance is one that is precise but not accurate:
it could work extremely well on the training data but miss on new
data, thus lacking in generalisation power. Conversely, an under-fit
model could be accurate but with low precision: on average, it works
well, but is unable to provide a precise answer. Ideally, we want
models that achieve good precision while still being applicable and
accurate with new data.
```{r, echo = FALSE, fig.cap = "Precision and accuracy when shooting.", fig.width = 10, fig.height = 10}
par(mar = c(0, 0, 0, 0))
plot(1:15, 1:15,
type = "n", xaxt = "n", yaxt = "n", xlab = "",
ylab = "")
##
plotrix::draw.circle(4, 11.5, 3, col = "steelblue")
plotrix::draw.circle(4, 11.5, 2, col = "red")
plotrix::draw.circle(4, 11.5, 1, col = "yellow")
set.seed(12)
x <- rnorm(20, 4.5, 1.2)
y <- rnorm(20, 11.5, 1.2)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(4, 15, "Accurate but not precise", cex = 1.3)
##
plotrix::draw.circle(12, 11.5, 3, col = "steelblue")
plotrix::draw.circle(12, 11.5, 2, col = "red")
plotrix::draw.circle(12, 11.5, 1, col = "yellow")
set.seed(12)
x <- rnorm(20, 12.75, 0.3)
y <- rnorm(20, 12.75, 0.3)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(12, 15, "Precise but not accurate", cex = 1.3)
##
plotrix::draw.circle(4, 4, 3, col = "steelblue")
plotrix::draw.circle(4, 4, 2, col = "red")
plotrix::draw.circle(4, 4, 1, col = "yellow")
set.seed(12)
x <- rnorm(20, 4, 0.3)
y <- rnorm(20, 4, 0.3)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(4, 7.5, "Accurate and precise", cex = 1.3)
##
plotrix::draw.circle(12, 4, 3, col = "steelblue")
plotrix::draw.circle(12, 4, 2, col = "red")
plotrix::draw.circle(12, 4, 1, col = "yellow")
set.seed(123)
x <- rnorm(20, 13, 1.2)
y <- rnorm(20, 5, 1.2)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(12, 7.5, "Neither precise nor accurate", cex = 1.3)
##
abline(v = 8)
abline(h = 8)
```
## Additional exercises
Load the `hyperLOPIT2015_se` data from the `pRolocdata` package. See
section \@ref(sec-dimred02)) for details. The features variable column
`markers` defines the proteins for which the sub-cellular localisation
is known - these are the labelled data. Those that are marked
`unknown` are of unknown location - these are the unlabelled data.
`r msmbstyle::question_begin()`
Perform a kNN classification using 5 nearest neighbours. How many
unlabelled data were assigned to any of the 14 classes?
`r msmbstyle::question_end()`
```{r echo = FALSE, include = FALSE}
suppressPackageStartupMessages(library("pRolocdata"))
data(hyperLOPIT2015_se)
library("class")
markers <- rowData(hyperLOPIT2015_se)$markers != "unknown"
unknown <- rowData(hyperLOPIT2015_se)$markers == "unknown"
knn5 <- knn(assay(hyperLOPIT2015_se)[markers, ],
assay(hyperLOPIT2015_se)[unknown, ],
rowData(hyperLOPIT2015_se)$markers[markers],
k = 5)
table(knn5)
```
`r msmbstyle::question_begin()`
To assess the results of this classification, visualise the data on
two PCA plots, one before and one after classification.
You can use the following colour palette that uses grey for the
unlabelled data.
```r
cls <- c("#E41A1C", "#377EB8", "#238B45", "#FF7F00", "#FFD700",
"#333333", "#00CED1", "#A65628", "#F781BF", "#984EA3",
"#9ACD32", "#B0C4DE", "#00008A", "#8B795E", "#E0E0E060")
```
`r msmbstyle::question_end()`
```{r, echo = FALSE, include = FALSE}
pca <- prcomp(assay(hyperLOPIT2015_se), scale. = TRUE)
library("factoextra")
cls <- c("#E41A1C", "#377EB8", "#238B45", "#FF7F00", "#FFD700",
"#333333", "#00CED1", "#A65628", "#F781BF", "#984EA3",
"#9ACD32", "#B0C4DE", "#00008A", "#8B795E", "#E0E0E060")
all_labels <- rowData(hyperLOPIT2015_se)$markers
all_labels[unknown] <- as.character(knn5)
fviz_pca_ind(pca, geom = "point",
habillage = rowData(hyperLOPIT2015_se)$markers,
palette = cls)
fviz_pca_ind(pca, geom = "point",
habillage = all_labels,
palette = cls)
```
`r msmbstyle::question_begin()`
Asses the choice of k=5 used above.
`r msmbstyle::question_end()`
```{r, echo = FALSE, include = FALSE}
hyper_preds <- assay(hyperLOPIT2015_se)[markers, ]
hyper_labels <- rowData(hyperLOPIT2015_se)[markers, "markers"]
folds <- caret::createFolds(hyper_labels, k = 10)
knn_accuracy <- function(fld, k = 5) {
## Uses global variables!
train <- hyper_preds[-fld, ]
test <- hyper_preds[fld, ]
test_k <- knn(train, test,
hyper_labels[-fld],
k)
sum(test_k == hyper_labels[fld])/length(test_k)
}
accs <- tibble(k = seq(1, 30, 2)) %>%
rowwise() %>%
mutate(acc = mean(sapply(folds, knn_accuracy, k))) %>%
ungroup()
accs
accs %>%
ggplot(aes(x = k, y = acc)) +
geom_point() +
geom_line()
```
`r msmbstyle::question_begin()`
Using a contingency table, compare you results with those that were
obtained using *support vector machine*, a popular classifier that was
used in Christoforou *et al.* [@Christoforou:2016], and available in
the `svm.classification` feature variable.
`r msmbstyle::question_end()`
```{r, echo = FALSE, include = FALSE}
table(rowData(hyperLOPIT2015_se)$svm.classification[unknown],
all_labels[unknown])
```
`r msmbstyle::question_begin()`
A classification algorithm will assign a class to any unlabelled
instance, even if it doesn't match any of the provided classes. This
is often the case in omics datasets such as the spatial proteomics
example above, where not all the sub-cellular niches are annotated or
known. We thus know that some of these assignments will be wrong from
a biological point of view.
Re-run the kNN classifer above, setting `prob = TRUE` to get the
proportion of votes in favour of the assigned class. These are stored
as an attribute named *prob*; read the `knn` manual page to learn how
to extract these from the `knn` output variable.
- Based on these vote proportions, which results to you consider the
most reliable?
- Re-generate a PCA plot keeping the classification results for the
most reliable assignments one, and setting the unreliable ones to
`unknown`. Interpret this figure.
`r msmbstyle::question_end()`
```{r, echo = FALSE, include = FALSE}
knn5 <- knn(assay(hyperLOPIT2015_se)[markers, ],
assay(hyperLOPIT2015_se)[unknown, ],
rowData(hyperLOPIT2015_se)$markers[markers],
k = 5,
prob = TRUE)
knn_probs <- attr(knn5, "prob")
table(knn_probs)
## those that have at least 4/5
table(knn_probs >= 4/5)
## those that have 5/5
table(knn_probs == 1)
knn_res <- as.character(knn5)
knn_res[knn_probs != 1] <- "unknown"
table(knn_res)
```
```{r echo = FALSE, include = FALSE}
all_labels <- rowData(hyperLOPIT2015_se)$markers
all_labels[all_labels == "unknown"] <- knn_res
fviz_pca_ind(pca, geom = "point",
habillage = all_labels,
palette = cls)
```