68-supervised.Rmd

# Supervised learning {#sec-sl}

The goal of this last chapter about machine learning will be
supervised learning, and in particular **classification**. In this
chapter, you will learn

- learn what classification is;
- learn about labelled, unlabelled, training and testing data;
- learn about and apply k-nearest neighbour, a simple non-parametric
  classifier;
- see why and how to use cross-validation;
- and learn about model complexity and generalisation.

## Introduction

Often, when faced with omics data, we have annotations for some of the
samples (or features) we have acquired:

- Among the 100s of patient tumours that were assayed using RNA
  sequencing or microarrays, we know the grade of the tumour for about
  half. We want to predict the grade of the other half using the gene
  expression profiles.

- We have performed a spatial proteomics experiment such as in
  Christoforou *et al.* [@Christoforou:2016] (see section
  \@ref(sec-dimred02)) and know the sub-cellular localisation of some
  proteins. We want to predict the sub-cellular localisation of the
  other proteins using the protein profiles.


In both of these examples, the quantitative data are the data that we
want to use to predict properties about samples or features; these are
called the **predictors**. The grade of the samples in the first
example and the protein sub-cellular localisation in the second one
are the **labels** that we have in some cases, and want to predict
otherwise. We can thus split our data in two parts, depending whether
we have labels, or whether we want to predict them. The former are
called **labelled**, and the latter **unlabelled**.


```{r, echo = FALSE, fig.cap = "In supervised learning, the data are split in labelled or unlabelled data. The same applies when some of the columns are labelled.", fig.width = 12, fig.height = 5}
par(mfrow = c(1, 2), mar = c(0, 1, 2, 0), oma = c(0, 0, 0, 0))
rWSBIM1322:::sl_input_rows(main = "All data")
text(rep(0.8, 8), (1:8) + 0.5, 1:8)

rWSBIM1322:::blank_plot()
title(main = "Labelled and unlabelled data")
rWSBIM1322:::quant_data()
rect(9.2, 8, 10.2, 9, col = "steelblue")
rect(9.2, 7, 10.2, 8, col = "steelblue")
rect(9.2, 6, 10.2, 7, col = "red")
rect(9.2, 5, 10.2, 6, col = "red")
rect(9.2, 4, 10.2, 5, col = "green")
text(rep(0.8, 8), (1:8) + 0.5,
     c(2, 4, 7, 5, 1, 6, 3, 8))
```

In the figure above, the labels represent categories that need to be
inferred from the predictors. This class of supervised learning is
called **classification**. Classification are also split into **binary
classification** when there are only two classes, or **multi-label**
problem when, like above, there are more than two. The latter is a
generalisation of the binary task. When the annotations are continuous
values, the situation is referred to as a **regression** problem.

`r msmbstyle::question_begin()`

Load the `giris2` data from the `rWSBIM1322` package (requires version
>= 0.1.13). Identify the labelled and unlabelled data; how many are
there respectively.

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`

```{r}
library("rWSBIM1322")
data(giris2)
summary(giris2)
table(is.na(giris2$GRADE))
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`

Visualise the `giris2` data on a PCA plot, highlighting the labels (or
absence thereof). From the visualisation, do you think classifying the
unlabelled data will be difficult or easy? Defend your answer.

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`

```{r}
pca <- prcomp(giris2[, -5], scale = TRUE)
cls <- as.character(giris2$GRADE)
cls[is.na(cls)] <- "unknown"
factoextra::fviz_pca_ind(pca, col.ind = cls)
```

`r msmbstyle::solution_end()`


## K-nearest neighbours classifier

In this chapter, we'll use a simple, but useful classification
algorithm, k-nearest neighbours (kNN) to classify the *giris2*
patients. We will use the `knn` function from the
`r CRANpkg("class")` package.

K-nearest neighbours works by directly measuring the (Euclidean)
distance between observations and inferring the class of unlabelled data
from the class of its nearest neighbours. In the figure below, the
unlabelled instances *1* and *2* will be assigned classes *A* (blue)
and *B* (red) as their closest neighbours are red and blue,
respectively.

```{r knnex, echo=FALSE, fig.cap="Schematic illustrating the k nearest neighbours algorithm."}
p1 <- c(0, 0)
p2 <- c(0.7, 0.5)
x1 <- rbind(c(0.2, 0.2),
            c(-0.3, -0.8),
            c(-0.2, 1.3))
x2 <- rbind(c(1, 1),
            c(0.5, 0.7))
x3 <- c(1.5, -.9)
x <- rbind(p1, p2, x1, x2, x3)
col <- c("black", "black",
         rep("steelblue", 3),
         rep("red", 2),
         "darkgreen")

plot(x, pch = 19, col = col,
     cex = 5, xlab = "", ylab = "",
     xaxt = "n", yaxt = "n")
grid()
text(p1[1], p1[2], "1", col = "white", cex = 2)
text(p2[1], p2[2], "2", col = "white", cex = 2)
for (i in 1:3)
    segments(p1[1], p1[2],
             x1[i, 1], x1[i, 2],
             lty = "dotted",
             col = "steelblue")
segments(p2[1], p2[2],
         x1[1, 1], x1[1, 2],
         lty = "dotted",
         col = "steelblue")
for (i in 1:2)
    segments(p2[1], p2[2],
             x2[i, 1], x2[i, 2],
             lty = "dotted",
             col = "red")
legend("topright",
       legend = LETTERS[1:3],
       pch = 19,
       col = c("steelblue", "red", "darkgreen"),
       cex = 2,
       bty = "n")
```

Typically in machine learning, there are two clear steps, where one
first **trains** a model and then uses the model to **predict** new
outputs (class labels in this case). In kNN, these two steps are
combined into a single function call to `knn`.


`r msmbstyle::question_begin()`

Separate the `giris2` data into two new datasets, one containing the
labelled data that will be used to train the model and named
`giris2_labelled`, and a second one containing the unlabelled data called
`giris2_unlabelled`.

`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r}
giris2_labelled <- giris2[!is.na(giris2$GRADE), ]
giris2_unlabelled <- giris2[is.na(giris2$GRADE), ]
```
`r msmbstyle::solution_end()`

The `knn` function takes, among others[^knnargs], the following arguments:

[^knnargs]: we will see additional ones later.

1. the labelled predictors, that will be used to *train* the model,
2. the unlabelled predictors, on which the model will be applied (see
   below why this is called *test*),
3. the labels (the length of this vector must match the number of rows
   of the labelled predictors).


`r msmbstyle::question_begin()`

- Apply the kNN classifier on the `giris2` prepared in the previous
  exercise.

- What is the class of the output?

- How many of the unlabelled data have been assigned to the different
  grades?

`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r}
library("class")
knn_res <- knn(giris2_labelled[, -5],
               giris2_unlabelled[, -5],
               giris2_labelled[, 5])
class(knn_res)
table(knn_res)
```
`r msmbstyle::solution_end()`

`r msmbstyle::question_begin()`
Visualise the results of the kNN classification on a PCA plot and
interpret the results based on the first PCA.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r, fig.cap = "Visualisation of the labelled and unlabelled data before (top) and after (bottom) classification.", fig.height = 10}
giris2_res <- giris2
giris2_res[is.na(giris2$GRADE), 5] <- knn_res
p1 <- factoextra::fviz_pca_ind(pca, col.ind = cls)
p2 <- factoextra::fviz_pca_ind(pca, col.ind = giris2_res$GRADE)
library("patchwork")
p1 / p2
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`
- Can you identify some questionable results? Explore the results for
  patient 167, that was assigned group B.

- To do so, calculated the distances between sample 167 and all other
  labelled data.

- Compare the labels of its 15 closest labelled data points.

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r, message = FALSE}
## all pairwise distances
dd <- as.matrix(dist(giris2[, -5]))
## extract only those for sample 167
giris2$dist_167 <- dd[167, ]

## look at the 15 nearest neighbours
library("tidyverse")
vote <- giris2 %>%
    head(150) %>%
    arrange(dist_167) %>%
    head(15) %>%
    mutate(vote = cumsum(count = ifelse(GRADE == "C", 1, -1))) %>%
    mutate(grade = ifelse(vote <= 0, "B", "C"))
vote

vote %>%
    ggplot(aes(x = 1:15, y = vote)) +
    xlab("Number of nearest neighbours") +
    geom_point(aes(colour = grade), size = 3) + geom_line() +
    geom_hline(yintercept = 0)
```

`r msmbstyle::solution_end()`

As we have seen, the number of nearest neighbours *k* has an important
influence on the classification results. We can now refine our
understanding of the `knn` function; it has the following arguments:

1. the labelled predictors, that will be used to *train* the model,
2. the unlabelled predictors, on which the model will be applied (see
   below why this is called *test*),
3. the labels (the length of this vector must match the number of rows
   of the labelled predictors),
4. the number of neighbours to use,
5. when set to `TRUE`, the `prob` argument also return the proportion
   of votes in favour of the assigned class.

`r msmbstyle::question_begin()`

- Repeat the kNN search comparing k=1 (as in our first attempt) and
  k=11 and compare the result.

- Which one is correct?

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r}
knn_1 <- knn(giris2_labelled[, -5], giris2_unlabelled[, -5], giris2_labelled[, 5], k = 1)
knn_11 <- knn(giris2_labelled[, -5], giris2_unlabelled[, -5], giris2_labelled[, 5], k = 11)
table(knn_1, knn_11)
```
`r msmbstyle::solution_end()`

## Model selection

There is no way to decide which set of results above, using k=1, 2,
... 11 or any other value for k is better. At least not while
proceeding as above. To be able to make an informed decision about the
**model parameter** k, we need to to measure the **performance** of
the kNN classifier for different values of k. To do so, we are going to
create **training** and **testing** sets using the labelled data. Each
of these will be composed by a certain proportion of the original
labelled data.

Below, we denote the training predictors $X_{tr}$ and labels $Y_{tr}$ and
the testing predictors $X_{te}$ and labels $Y_{te}$.

```{r, echo = FALSE, fig.cap = "Training and testing sets."}
par(mar = c(1, 1, 1, 1))
rWSBIM1322:::blank_plot()
rect(1, 1, 8, 4, col = "orange", border = NA)
rect(8.2, 1, 9.2, 4, col = "orange", border = NA)
rect(1, 4.2, 8, 9, col = "steelblue", border = NA)
rect(8.2, 4.2, 9.2, 9, col = "steelblue", border = NA)
rect(1, 1, 8, 9, lwd = 2)
text(4.5, 9.5, "X", cex = 3)
rect(8.2, 1, 9.2, 9, lwd = 2)
text(8.7, 9.5, "Y", cex = 3)
text(4.5, 7, expression(X[tr]), cex = 2)
text(4.5, 2.5, expression(X[te]), cex = 2)
text(8.7, 7, expression(Y[tr]), cex = 2)
text(8.7, 2.5, expression(Y[te]), cex = 2)
text(9.6, 6.5, expression(training), srt = 90, cex = 1.5)
text(9.6, 2.5, expression(testing), srt = 90, cex = 1.5)
```

We are now going to do the following procedure

1. hide the testing labels $Y_{te}$,
2. train a classifier model using $X_{tr}$ and $Y_{tr}$,
3. apply it on $X_{te}$ to obtain a new $Y_{te}^{predicted}$, and
4. compare $Y_{te}$ to $Y_{te}^{predicted}$.

There are numerous different ways to measure the performance of a
classifier using $Y_{te}$ and $Y_{te}^{predicted}$.
We are going to focus on the **classification accuracy**, counting the
proportion of correct results.


`r msmbstyle::question_begin()`
Choose 100 random labelled data points to define the training
data. Use the 50 others as test data.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r}
set.seed(1)
i <- sample(150, 100)
giris2_train <- giris2_labelled[i, ]
giris2_test <- giris2_labelled[-i, ]
```
`r msmbstyle::solution_end()`

`r msmbstyle::question_begin()`
Run two kNN classifications, one with k = 1, then one with k = 11 and
compare each to the true results. A good way to assess the results is
to generate a contingency table (using the `table` function) that
tallies matches and mis-matches between the predictions and expected
assignments. Interpret these results.
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r}
test_1 <- knn(giris2_train[, -5], giris2_test[, -5], giris2_train[, 5], k = 1)
test_11 <- knn(giris2_train[, -5], giris2_test[, -5], giris2_train[, 5], k = 11)
table(giris2_test[, 5], test_1)
table(giris2_test[, 5], test_11)
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`
Calculate the classification accuracy of the two classifiers above.
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r}
sum(test_1 == giris2_test[, 5])/length(test_1)
sum(test_11 == giris2_test[, 5])/length(test_11)
```
`r msmbstyle::solution_end()`


There are now two adjustments that we can do improve on the procedure
above:

1. Test all odd value of k from 1 to 11 (or higher, if deemed useful),
   to have a better granularity of the model parameter we test.

2. Testing more than on training/testing split. Indeed, relying on a
   single random split, we rely on a random configuration, that could
   affect that results either overly optimistically, or negatively. We
   prefer to repeat the split *a certain number of time* and calculate
   an average performance over all splits.

## k-fold cross-validation

There exist several principled ways to split data into training and
testing partitions. Here, we are going to focus on k-fold
cross-validation, where data of size $n$ a repeatedly split into a training
set of size around $\frac{n (k - 1)}{k}$ and a testing set of size around $\frac{n}{k}$.

In practice, the data are split into k partitions of size $\frac{n}{k}$,
and the training/testing procedure is repeated $k$ times using $k - 1$ partition
as training data and 1 partition as testing data.

The figure below illustrates the cross validation procedure, creating
3 folds. One would typically do a 10-fold cross validation (if the
size of the data permits it). We split the data into 3 random and
complementary folds, so that each data point appears exactly once in
each fold. This leads to a total test set size that is identical to
the size of the full annotated dataset but is composed of
out-of-sample predictions (i.e. a sample is never used for training
and testing).

```{r, echo = FALSE, fig.cap = "The data is split into 3 folds. At each training/testing iteration, a different fold is used as test partition (white) and the two other ones (blue) are used to train the model.", out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/xval.png")
```

After cross-validation, all models used within each fold are
discarded, and a new model is built using the whole labelled dataset,
with the best model parameter(s), i.e those that generalised over all
folds. This makes cross-validation quite time consuming, as it takes
x+1 (where x in the number of cross-validation folds) times as long as
fitting a single model, but is essential.


`r msmbstyle::question_begin()`
Use the `createFolds` function from the `caret` package, passing it
the labelled tags, to create 10 folds. Verify that each sample is
present only once.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r}
folds <- caret::createFolds(giris2_labelled[[5]], k = 10)
folds
all(sort(unlist(folds)) == 1:150)
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`

Focusing now k=11 and using the folds above, calculate the
classification accuracy in each case and compute the average accuracy
for k=11.

`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`

```{r}
knn_accuracy <- function(fld, k) {
    ## Uses global variables!
    train <- giris2_labelled[-fld, ]
    test <- giris2_labelled[fld, ]
    test_k <- knn(train[, -5],
                  test[, -5],
                  train[, 5],
                  k)
    sum(test_k == test[, 5])/length(test_k)
}

accs <- sapply(folds, knn_accuracy, k = 11)
accs
mean(accs)
```
`r msmbstyle::solution_end()`

`r msmbstyle::question_begin()`
Repeat the above for k=1, 3, 5, ... to 30 and plot the average
accuracy as a function of k.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r, fig.cap = "Average classification accuracy as a function of the parameter k."}
accs <- tibble(k = seq(1, 30, 2)) %>%
    rowwise() %>%
    mutate(acc = mean(sapply(folds, knn_accuracy, k))) %>%
    ungroup()
accs

accs %>%
    ggplot(aes(x = k, y = acc)) +
    geom_point() +
    geom_line()
```
`r msmbstyle::solution_end()`

As we can see from the last exercise, most values of k give a good
classification accuracy, ranging from 96% to 98%, and it would be
difficult to identify a single best value for k. Indeed, the three
groups we are typing to assign new data to are relatively well
separated and the little uncertainty that we observe is most likely
due to some instances that lie at the boundary between classes B and C.

There exist a variety of packages (for example `caret`, `tidymodels`
or `mlr`) that automate model and parameter selection using
cross-validation procedures as illustrated above, so that in practice,
these tasks can be automated. This becomes particularly useful when
more than one model parameter (called hyper-parameters) need to tuned,
different classifiers need to be assessed, and/or several model
performance metrics are to be computed.

## Variance and bias trade-off

A recurrent goal in machine learning is to find a trade-off between
variance and bias when fitting models. It is important to build a
good model using the data at hand, i.e. that learns from the data, but
not too much. If the model is too specific for the data that was used
to build it (the model would be said to be **over fitting the data**)
and would not be applicable to new data (the model would have high
bias). On the other hand, a model that is too general, that wouldn't
be applicable enough to the data, or type of data at hand[^varex], it
would perform poorly, and hence lead to high variance.

[^varex]: In our example, this would equate to the expression of the
    four genes of interest in a specific population of cancer
    patients, for example.


This variance and bias trade-off can be illustrated in different
ways. The figure below (reproduced with permission from [@ISLR]) shows
the prediction error as a function of the model complexity. In our
example above, we have used accuracy, a metric that we want to
optimise, while in the prediction error is a measure to be minimised.


```{r, echo = FALSE, fig.cap = "Effect of model complexity on the prediction error (lower is better) on the training and testing data. The former decreases for lower values of $k$ while the test error reaches a minimum around $k = 10$ before increasing again. Reproduced with permission from James *et al.* 2014)", out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/ISL-2_17.png")
```

The model complexity represent the ability of
the model (or model parameter) to learn/adapt to the training data. In
our case, the complexity would be illustrated by$\frac{1}{k}$: when k=1,
the model becomes very complex as it adapts to every single
neighbour. The other extreme, when k becomes large, the classification
will tend to using an average of so many neighbours that doesn't
reflect any specificity of our data.

```{r, echo = FALSE, fig.cap = "The kNN classifier using k=1 (left, solid classification boundaries) and k=100 (right, solid classification boundaries) compared the Bayes decision boundaries (see original material for details). Reproduced with permission from James *et al.* 2014).", out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/ISL-2_16.png")
```

This last figure, reproduced from [@MSMB], frames the variance-bias
trade-off as one between accuracy and precision. An over-fit model with
high bias and low variance is one that is precise but not accurate:
it could work extremely well on the training data but miss on new
data, thus lacking in generalisation power. Conversely, an under-fit
model could be accurate but with low precision: on average, it works
well, but is unable to provide a precise answer. Ideally, we want
models that achieve good precision while still being applicable and
accurate with new data.


```{r, echo = FALSE, fig.cap = "Precision and accuracy when shooting.", fig.width = 10, fig.height = 10}
par(mar = c(0, 0, 0, 0))
plot(1:15, 1:15,
     type = "n", xaxt = "n", yaxt = "n", xlab = "",
     ylab = "")
##
plotrix::draw.circle(4, 11.5, 3, col = "steelblue")
plotrix::draw.circle(4, 11.5, 2, col = "red")
plotrix::draw.circle(4, 11.5, 1, col = "yellow")
set.seed(12)
x <- rnorm(20, 4.5, 1.2)
y <- rnorm(20, 11.5, 1.2)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(4, 15, "Accurate but not precise", cex = 1.3)
##
plotrix::draw.circle(12, 11.5, 3, col = "steelblue")
plotrix::draw.circle(12, 11.5, 2, col = "red")
plotrix::draw.circle(12, 11.5, 1, col = "yellow")
set.seed(12)
x <- rnorm(20, 12.75, 0.3)
y <- rnorm(20, 12.75, 0.3)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(12, 15, "Precise but not accurate", cex = 1.3)
##
plotrix::draw.circle(4, 4, 3, col = "steelblue")
plotrix::draw.circle(4, 4, 2, col = "red")
plotrix::draw.circle(4, 4, 1, col = "yellow")
set.seed(12)
x <- rnorm(20, 4, 0.3)
y <- rnorm(20, 4, 0.3)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(4, 7.5, "Accurate and precise", cex = 1.3)
##
plotrix::draw.circle(12, 4, 3, col = "steelblue")
plotrix::draw.circle(12, 4, 2, col = "red")
plotrix::draw.circle(12, 4, 1, col = "yellow")
set.seed(123)
x <- rnorm(20, 13, 1.2)
y <- rnorm(20, 5, 1.2)
points(x, y)
points(mean(x), mean(y), pch = 19)
text(12, 7.5, "Neither precise nor accurate", cex = 1.3)
##
abline(v = 8)
abline(h = 8)
```

## Additional exercises


Load the `hyperLOPIT2015_se` data from the `pRolocdata` package. See
section \@ref(sec-dimred02)) for details. The features variable column
`markers` defines the proteins for which the sub-cellular localisation
is known - these are the labelled data. Those that are marked
`unknown` are of unknown location - these are the unlabelled data.

`r msmbstyle::question_begin()`

Perform a kNN classification using 5 nearest neighbours. How many
unlabelled data were assigned to any of the 14 classes?

`r msmbstyle::question_end()`

```{r echo = FALSE, include = FALSE}
suppressPackageStartupMessages(library("pRolocdata"))
data(hyperLOPIT2015_se)
library("class")

markers <- rowData(hyperLOPIT2015_se)$markers != "unknown"
unknown <- rowData(hyperLOPIT2015_se)$markers == "unknown"

knn5 <- knn(assay(hyperLOPIT2015_se)[markers, ],
            assay(hyperLOPIT2015_se)[unknown, ],
            rowData(hyperLOPIT2015_se)$markers[markers],
            k = 5)

table(knn5)
```

`r msmbstyle::question_begin()`
To assess the results of this classification, visualise the data on
two PCA plots, one before and one after classification.

You can use the following colour palette that uses grey for the
unlabelled data.

```r
cls <- c("#E41A1C", "#377EB8", "#238B45", "#FF7F00", "#FFD700",
         "#333333", "#00CED1", "#A65628", "#F781BF", "#984EA3",
         "#9ACD32", "#B0C4DE", "#00008A", "#8B795E", "#E0E0E060")
```
`r msmbstyle::question_end()`

```{r, echo = FALSE, include = FALSE}
pca <- prcomp(assay(hyperLOPIT2015_se), scale. = TRUE)
library("factoextra")

cls <- c("#E41A1C", "#377EB8", "#238B45", "#FF7F00", "#FFD700",
         "#333333", "#00CED1", "#A65628", "#F781BF", "#984EA3",
         "#9ACD32", "#B0C4DE", "#00008A", "#8B795E", "#E0E0E060")

all_labels <- rowData(hyperLOPIT2015_se)$markers
all_labels[unknown] <- as.character(knn5)

fviz_pca_ind(pca, geom = "point",
             habillage = rowData(hyperLOPIT2015_se)$markers,
             palette = cls)

fviz_pca_ind(pca, geom = "point",
             habillage = all_labels,
             palette = cls)
```

`r msmbstyle::question_begin()`
Asses the choice of k=5 used above.
`r msmbstyle::question_end()`


```{r, echo = FALSE, include = FALSE}
hyper_preds <- assay(hyperLOPIT2015_se)[markers, ]
hyper_labels <- rowData(hyperLOPIT2015_se)[markers, "markers"]
folds <- caret::createFolds(hyper_labels, k = 10)

knn_accuracy <- function(fld, k = 5) {
    ## Uses global variables!
    train <- hyper_preds[-fld, ]
    test <- hyper_preds[fld, ]
    test_k <- knn(train, test,
                  hyper_labels[-fld],
                  k)
    sum(test_k == hyper_labels[fld])/length(test_k)
}

accs <- tibble(k = seq(1, 30, 2)) %>%
    rowwise() %>%
    mutate(acc = mean(sapply(folds, knn_accuracy, k))) %>%
    ungroup()
accs
accs %>%
    ggplot(aes(x = k, y = acc)) +
    geom_point() +
    geom_line()
```

`r msmbstyle::question_begin()`

Using a contingency table, compare you results with those that were
obtained using *support vector machine*, a popular classifier that was
used in Christoforou *et al.* [@Christoforou:2016], and available in
the `svm.classification` feature variable.

`r msmbstyle::question_end()`

```{r, echo = FALSE, include = FALSE}
table(rowData(hyperLOPIT2015_se)$svm.classification[unknown],
      all_labels[unknown])
```

`r msmbstyle::question_begin()`

A classification algorithm will assign a class to any unlabelled
instance, even if it doesn't match any of the provided classes. This
is often the case in omics datasets such as the spatial proteomics
example above, where not all the sub-cellular niches are annotated or
known. We thus know that some of these assignments will be wrong from
a biological point of view.

Re-run the kNN classifer above, setting `prob = TRUE` to get the
proportion of votes in favour of the assigned class. These are stored
as an attribute named *prob*; read the `knn` manual page to learn how
to extract these from the `knn` output variable.

- Based on these vote proportions, which results to you consider the
  most reliable?

- Re-generate a PCA plot keeping the classification results for the
  most reliable assignments one, and setting the unreliable ones to
  `unknown`. Interpret this figure.

`r msmbstyle::question_end()`

```{r, echo = FALSE, include = FALSE}
knn5 <- knn(assay(hyperLOPIT2015_se)[markers, ],
            assay(hyperLOPIT2015_se)[unknown, ],
            rowData(hyperLOPIT2015_se)$markers[markers],
            k = 5,
            prob = TRUE)

knn_probs <- attr(knn5, "prob")
table(knn_probs)
## those that have at least 4/5
table(knn_probs >= 4/5)
## those that have 5/5
table(knn_probs == 1)
knn_res <- as.character(knn5)
knn_res[knn_probs != 1] <- "unknown"
table(knn_res)
```

```{r echo = FALSE, include = FALSE}
all_labels <- rowData(hyperLOPIT2015_se)$markers
all_labels[all_labels == "unknown"] <- knn_res

fviz_pca_ind(pca, geom = "point",
             habillage = all_labels,
             palette = cls)
```