07-stats.Rmd

# Modern classical statistics {#modchapter}
"Modern classical" may sound like a contradiction, but it is in fact anything but. Classical statistics covers topics like estimation, quantification of uncertainty, and hypothesis testing - all of which are at the heart of data analysis. Since the advent of modern computers, much has happened in this field that has yet to make it to the standard textbooks of introductory courses in statistics. This chapter attempts to bridge part of that gap by dealing with those classical topics, but with a modern approach that uses more recent advances in statistical theory and computational methods. Particular focus is put on how simulation can be used for analyses and for evaluating the properties of statistical procedures.

Whenever it is feasible, our aim in this chapter and the next is to:

* Use hypothesis tests that are based on permutations or the bootstrap rather than tests based on strict assumptions about the distribution of the data or asymptotic distributions,
* To complement estimates and hypothesis tests with computing confidence intervals based on sound methods (including the bootstrap),
* Offer easy-to-use Bayesian methods as an alternative to frequentist tools.


After reading this chapter, you will be able to use R to:

* Generate random numbers,
* Perform simulations to assess the performance of statistical methods,
* Perform hypothesis tests,
* Compute confidence intervals,
* Make sample size computations,
* Report statistical results.

## Simulation and distributions {#simulation}
A _random variable_\index{random variable} is a variable whose value describes the outcome of a random phenomenon. A (probability) _distribution_\index{distribution} is a mathematical function that describes the probability of different outcomes for a random variable. Random variables and distributions are at the heart of probability theory and most, if not all, statistical models.

As we shall soon see, they are also invaluable tools when evaluating statistical methods. A key component of modern statistical work is _simulation_\index{simulation}, in which we generate artificial data that can be used both in the analysis of real data (e.g. in permutation tests and bootstrap confidence intervals, topics that we'll explore in this chapter) and for assessing different methods. Simulation is possible only because we can generate random numbers, so let's begin by having a look at how we can generate random numbers in R.

### Generating random numbers
The function `sample`\index{\texttt{sample}} can be used to randomly draw a number of elements from a vector. For instance, we can use it to draw 2 random numbers from the first ten integers: $1, 2, \ldots, 9, 10$:

```{r eval=FALSE}
sample(1:10, 2)
```

Try running the above code multiple times. You'll get different results each time, because each time it runs the random number generator is in a different _state_. In most cases, this is desirable (if the results were the same each time we used `sample`, it wouldn't be random), but not if we want to replicate a result at some later stage.

When we are concerned about reproducibility, we can use `set.seed`\index{\texttt{set.seed}} to fix the state of the random number generator:

```{r eval=FALSE}
# Each run generates different results:
sample(1:10, 2); sample(1:10, 2)

# To get the same result each time, set the seed to a
# number of your choice:
set.seed(314); sample(1:10, 2)
set.seed(314); sample(1:10, 2)
```

We often want to use simulated data from a probability distribution, such as the normal distribution. The normal distribution is defined by its mean $\mu$ and its variance $\sigma^2$ (or, equivalently, its standard deviation $\sigma$). There are special functions for generating data from different distributions - for the normal distribution it is called `rnorm`. We specify the number of observations that we want to generate (`n`) and the parameters of the distribution (the mean `mu` and the standard deviation `sigma`):

```{r eval=FALSE}
rnorm(n = 10, mu = 2, sigma = 1)

# A shorter version:
rnorm(10, 2, 1)
```

Similarly, there are functions that can be used compute the quantile function, density function, and cumulative distribution function (CDF) of the normal distribution. Here are some examples for a normal distribution with mean 2 and standard deviation 1:

```{r eval=FALSE}
qnorm(0.9, 2, 1)    # Upper 90 % quantile of distribution
dnorm(2.5, 2, 1)    # Density function f(2.5)
pnorm(2.5, 2, 1)    # Cumulative distribution function F(2.5)
```

$$\sim$$

```{exercise, label="ch7exc1"}
Sampling can be done with or without _replacement_. If replacement is used, an observation can be drawn more than once. Check the documentation for `sample`. How can you change the settings to sample with replacement? Draw 5 random numbers from the first ten integers, with replacement.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions1)


### Some common distributions {#distfunctions}
Next, we provide the syntax for random number generation, quantile functions, density/probability functions and cumulative distribution functions for some of the most commonly used distributions. This section is mainly intended as a reference, for you to look up when you need to use one of these distributions - so there is no need to run all the code chunks below right now.

Normal distribution $N(\mu, \sigma^2)$ with mean $\mu$ and variance $\sigma^2$\index{distribution!normal}\index{\texttt{rnorm}}:

```{r eval=FALSE}
rnorm(n, mu, sigma)    # Generate n random numbers
qnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dnorm(x, mu, sigma)    # Density function f(x)
pnorm(x, mu, sigma)    # Cumulative distribution function F(X)
```

Continuous uniform distribution $U(a,b)$ on the interval $(a,b)$, with mean $\frac{a+b}{2}$ and variance $\frac{(b-a)^2}{12}$\index{distribution!uniform}\index{\texttt{runif}}:

```{r eval=FALSE}
runif(n, a, b)    # Generate n random numbers
qunif(0.95, a, b) # Upper 95 % quantile of distribution
dunif(x, a, b)    # Density function f(x)
punif(x, a, b)    # Cumulative distribution function F(X)
```

Exponential distribution $Exp(m)$ with mean $m$ and variance $m^2$\index{distribution!exponential}\index{\texttt{rexp}}:

```{r eval=FALSE}
rexp(n, 1/m)    # Generate n random numbers
qexp(0.95, 1/m) # Upper 95 % quantile of distribution
dexp(x, 1/m)    # Density function f(x)
pexp(x, 1/m)    # Cumulative distribution function F(X)
```

Gamma distribution $\Gamma(\alpha, \beta)$ with mean $\frac{\alpha}{\beta}$ and variance $\frac{\alpha}{\beta^2}$\index{distribution!gamma}\index{\texttt{rgamma}}:

```{r eval=FALSE}
rgamma(n, alpha, beta)    # Generate n random numbers
qgamma(0.95, alpha, beta) # Upper 95 % quantile of distribution
dgamma(x, alpha, beta)    # Density function f(x)
pgamma(x, alpha, beta)    # Cumulative distribution function F(X)
```

Lognormal distribution $LN(\mu, \sigma^2)$ with mean $\exp(\mu+\sigma^2/2)$ and variance $(\exp(\sigma^2)-1)\exp(2\mu+\sigma^2)$\index{distribution!lognormal}\index{\texttt{rlnorm}}:

```{r eval=FALSE}
rlnorm(n, mu, sigma)    # Generate n random numbers
qlnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dlnorm(x, mu, sigma)    # Density function f(x)
plnorm(x, mu, sigma)    # Cumulative distribution function F(X)
```

t-distribution $t(\nu)$ with mean 0 (for $\nu>1$) and variance $\frac{\nu}{\nu-2}$ (for $\nu>2$)\index{distribution!t}\index{\texttt{rt}}:

```{r eval=FALSE}
rt(n, nu)    # Generate n random numbers
qt(0.95, nu) # Upper 95 % quantile of distribution
dt(x, nu)    # Density function f(x)
pt(x, nu)    # Cumulative distribution function F(X)
```

Chi-squared distribution $\chi^2(k)$ with mean $k$ and variance $2k$\index{distribution!$\chi^2$}\index{\texttt{rchisq}}:

```{r eval=FALSE}
rchisq(n, k)    # Generate n random numbers
qchisq(0.95, k) # Upper 95 % quantile of distribution
dchisq(x, k)    # Density function f(x)
pchisq(x, k)    # Cumulative distribution function F(X)
```

F-distribution $F(d_1, d_2)$ with mean $\frac{d_2}{d_2-2}$ (for $d_2>2$) and variance $\frac{2d_2^2(d_1+d_2-2)}{d_1(d_2-2)^2(d_2-4)}$ (for $d_2>4$)\index{distribution!F}\index{\texttt{rf}}:

```{r eval=FALSE}
rf(n, d1, d2)    # Generate n random numbers
qf(0.95, d1, d2) # Upper 95 % quantile of distribution
df(x, d1, d2)    # Density function f(x)
pf(x, d1, d2)    # Cumulative distribution function F(X)
```

Beta distribution\index{distribution!beta}\index{\texttt{rbeta}} $Beta(\alpha,\beta)$ with mean $\frac{\alpha}{\alpha+\beta}$ and variance $\frac{\alpha \beta}{(\alpha+\beta)^2 (\alpha+\beta+1)}$:

```{r eval=FALSE}
rbeta(n, alpha, beta)    # Generate n random numbers
qbeta(0.95, alpha, beta) # Upper 95 % quantile of distribution
dbeta(x, alpha, beta)    # Density function f(x)
pbeta(x, alpha, beta)    # Cumulative distribution function F(X)
```

Binomial distribution $Bin(n,p)$ with mean $np$ and variance $np(1-p)$\index{distribution!binomial}\index{\texttt{rbinom}}:

```{r eval=FALSE}
rbinom(n, n, p)    # Generate n random numbers
qbinom(0.95, n, p) # Upper 95 % quantile of distribution
dbinom(x, n, p)    # Probability function f(x)
pbinom(x, n, p)    # Cumulative distribution function F(X)
```

Poisson distribution $Po(\lambda)$ with mean $\lambda$ and variance $\lambda$\index{distribution!Poisson}\index{\texttt{rpois}}:

```{r eval=FALSE}
rpois(n, lambda)    # Generate n random numbers
qpois(0.95, lambda) # Upper 95 % quantile of distribution
dpois(x, lambda)    # Probability function f(x)
ppois(x, lambda)    # Cumulative distribution function F(X)
```

Negative binomial distribution $NegBin(r, p)$ with mean $\frac{rp}{1-p}$ and variance $\frac{rp}{(1-p)^2}$\index{distribution!negative binomial}\index{\texttt{rnbinom}}:

```{r eval=FALSE}
rnbinom(n, r, p)    # Generate n random numbers
qnbinom(0.95, r, p) # Upper 95 % quantile of distribution
dnbinom(x, r, p)    # Probability function f(x)
pnbinom(x, r, p)    # Cumulative distribution function F(X)
```

Multivariate normal distribution with mean vector $\mu$ and covariance matrix $\Sigma$:

```{r eval=FALSE}
library(MASS)
mvrnorm(n, mu, Sigma) # Generate n random numbers
```

$$\sim$$

```{exercise, label="ch7exc2"}
Use `runif` and (at least) one of `round`, `ceiling` and `floor` to generate observations from a discrete random variable on the integers $1, 2, 3, 4, 5, 6, 7, 8, 9, 10$.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions2)


### Assessing distributional assumptions
So how can we know that the functions for generating random observations from distributions work? And when working with real data, how can we know what distribution fits the data? One answer is that we can visually compare the distribution of the generated (or real) data to the target distribution. This can for instance be done by comparing a histogram of the data to the target distribution's density function.

To do so, we must add `aes(y = ..density..))` to the call to `geom_histogram`, which rescales the histogram to have area 1 (just like a density function has). We can then add the density function using `geom_function`\index{\texttt{geom\_function}}:

```{r eval=FALSE}
# Generate data from a normal distribution with mean 10 and
# standard deviation 1
generated_data <- data.frame(normal_data = rnorm(1000, 10, 1))

library(ggplot2)
# Compare to histogram:
ggplot(generated_data, aes(x = normal_data)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                 args = list(mean = mean(generated_data$normal_data), 
                                sd = sd(generated_data$normal_data)))
```

Try increasing the number of observations generated. As the number of observations increase, the histogram should start to look more and more like the density function.

We could also add a density estimate for the generated data, to further aid the eye here - we'd expect this to be close to the theoretical density function:

```{r eval=FALSE}
# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_density(colour = "blue", size = 2) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                args = list(mean = mean(generated_data$normal_data), 
                               sd = sd(generated_data$normal_data)))
```

If instead we wished to compare the distribution of the data to a $\chi^2$ distribution, we would change the value of `fun` and `args` in `geom_function` accordingly:

```{r eval=FALSE}
# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_density(colour = "blue", size = 2) +
      geom_function(fun = dchisq, colour = "red", size = 2,
                  args = list(df = mean(generated_data$normal_data)))
```

Note that the values of `args` have changed. `args` should always be a list containing values for the parameters of the distribution: `mu` and `sigma` for the normal distribution and `df` for the $\chi^2$ distribution (the same as in Section \@ref(distfunctions)).

Another option is to draw a quantile-quantile plot, or Q-Q plot for short, which compares the theoretical quantiles of a distribution to the empirical quantiles of the data, showing each observation as a point. If the data follows the theorised distribution, then the points should lie more or less along a straight line.

To draw a Q-Q plot for a normal distribution, we use the geoms `geom_qq` and `geom_qq_line`\index{\texttt{geom\_qq}}\index{\texttt{geom\_qq\_line}}:

```{r eval=FALSE}
# Q-Q plot for normality:
ggplot(generated_data, aes(sample = normal_data)) +
        geom_qq() + geom_qq_line()
```

For all other distributions, we must provide the quantile function of the distribution (many of which can be found in Section \@ref(distfunctions)):

```{r eval=FALSE}
# Q-Q plot for the lognormal distribution:
ggplot(generated_data, aes(sample = normal_data)) +
        geom_qq(distribution = qlnorm) +
        geom_qq_line(distribution = qlnorm)
```

Q-Q-plots can be a little difficult to read. There will always be points deviating from the line - in fact, that's expected. So how much must they deviate before we rule out a distributional assumption? Particularly when working with real data, I like to compare the Q-Q-plot of my data to Q-Q-plots of simulated samples from the assumed distribution, to get a feel for what kind of deviations can appear if the distributional assumption holds. Here's an example of how to do this, for the normal distribution:

```{r eval=FALSE}
# Look at solar radiation data for May from the airquality
# dataset:
May <- airquality[airquality$Month == 5,]

# Create a Q-Q-plot for the solar radiation data, and store
# it in a list:
qqplots <- list(ggplot(May, aes(sample = Solar.R)) +
  geom_qq() + geom_qq_line() + ggtitle("Actual data"))

# Compute the sample size n: 
n <- sum(!is.na(May$Temp))

# Generate 8 new datasets of size n from a normal distribution.
# Then draw Q-Q-plots for these and store them in the list:
for(i in 2:9)
{
    generated_data <- data.frame(normal_data = rnorm(n, 10, 1))
    qqplots[[i]] <- ggplot(generated_data, aes(sample = normal_data)) +
      geom_qq() + geom_qq_line() + ggtitle("Simulated data")
}

# Plot the resulting Q-Q-plots side-by-side:
library(patchwork)
(qqplots[[1]] + qqplots[[2]] + qqplots[[3]]) /
  (qqplots[[4]] + qqplots[[5]] + qqplots[[6]]) /
  (qqplots[[7]] + qqplots[[8]] + qqplots[[9]])
```

You can run the code several times, to get more examples of what Q-Q-plots can look like when the distributional assumption holds. In this case, the tail points in the Q-Q-plot for the solar radiation data deviate from the line more than the tail points in most simulated examples do, and personally, I'd be reluctant to assume that the data comes from a normal distribution.

$$\sim$$
```{exercise, label="ch7exc3"}
Investigate the sleeping times in the `msleep` data from the `ggplot2` package. Do they appear to follow a normal distribution? A lognormal distribution?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions3)


<br>

```{exercise, label="ch7exc4"}
Another approach to assessing distributional assumptions for real data is to use formal hypothesis tests. One example is the Shapiro-Wilk test for normality, available in `shapiro.test`\index{\texttt{shapiro.test}}. The null hypothesis is that the data comes from a normal distribution, and the alternative is that it doesn't (meaning that a low p-value is supposed to imply non-normality).

1. Apply `shapiro.test` to the sleeping times in the `msleep` dataset. According to the Shapiro-Wilk test, is the data normally distributed?

2. Generate 2,000 observations from a $\chi^2(100)$ distribution. Compare the histogram of the generated data to the density function of a normal distribution. Are they similar? What are the results when you apply the Shapiro-Wilk test to the data?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions4)


### Monte Carlo integration
In this chapter, we will use simulation to compute p-values and confidence intervals, to compare different statistical methods, and to perform sample size computations. Another important use of simulation is in _Monte Carlo integration_, in which random numbers are used for numerical integration. It plays an important role in for instance statistical physics, computational biology, computational linguistics, and Bayesian statistics; fields that require the computation of complicated integrals.

To create an example of Monte Carlo integration, let's start by writing a function, `circle`, that defines a quarter-circle on the unit square. We will then plot it using the geom `geom_function`\index{\texttt{geom\_function}}:

```{r eval=FALSE}
circle <- function(x)
{
      return(sqrt(1-x^2))
}

ggplot(data.frame(x = c(0, 1)), aes(x)) +
      geom_function(fun = circle)
```

Let's say that we are interest in computing the area under quarter-circle. We can highlight the area in our plot using `geom_area`\index{\texttt{geom\_area}}:

```{r eval=FALSE}
ggplot(data.frame(x = seq(0, 1, 1e-4)), aes(x)) +
      geom_area(aes(x = x,
                    y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)),
                    fill = "pink") +
      geom_function(fun = circle)
```

To find the area, we will generate a large number of random points uniformly in the unit square. By the law of large numbers, the proportion of points that end up under the quarter-circle should be close to the area under the quarter-circle^[In general, the proportion of points that fall below the curve will be proportional to the area under the curve _relative_ to the area of the sample space. In this case the sample space is the unit square, which has area 1, meaning that the relative area is the same as the absolute area.]. To do this, we generate 10,000 random values for the $x$ and $y$ coordinates of each point using the $U(0,1)$ distribution, that is, using `runif`:

```{r eval=FALSE}
B <- 1e4
unif_points <- data.frame(x = runif(B), y = runif(B))
```

Next, we add the points to our plot:

```{r eval=FALSE}
ggplot(unif_points, aes(x, y)) +
      geom_area(aes(x = x,
                    y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)),
                    fill = "pink") +
      geom_point(size = 0.5, alpha = 0.25,
               colour = ifelse(unif_points$x^2 + unif_points$y^2 <= 1,
                               "red", "black")) +
      geom_function(fun = circle)
```

Note the order in which we placed the geoms - we plot the points after the area so that the pink colour won't cover the points, and the function after the points so that the points won't cover the curve.

To estimate the area, we compute the proportion of points that are below the curve:

```{r eval=FALSE}
mean(unif_points$x^2 + unif_points$y^2 <= 1)
```

In this case, we can also compute the area exactly: $\int_0^1\sqrt{1-x^2}dx=\pi/4=0.7853\ldots$. For more complicated integrals, however, numerical integration methods like Monte Carlo integration may be required. That being said, there are better numerical integration methods for low-dimensional integrals like this one. Monte Carlo integration is primarily used for higher-dimensional integrals, where other techniques fail.


## Student's t-test revisited {#ttest}
For decades teachers all over the world have been telling the story of William Sealy Gosset: the head brewer at Guinness who derived the formulas used for the t-test and, following company policy, published the results under the pseudonym "Student".

Gosset's work was hugely important, but the passing of time has rendered at least parts of it largely obsolete. His distributional formulas were derived out of necessity: lacking the computer power that we have available to us today, he was forced to impose the assumption of normality on the data, in order to derive the formulas he needed to be able to carry out his analyses. Today we can use simulation to carry out analyses with fewer assumptions. As an added bonus, these simulation techniques often happen to result in statistical methods with better performance than Student's t-test and other similar methods.

### The old-school t-test {#oldschoolt}
The _really_ old-school way of performing a t-test - the way statistical pioneers like Gosset and Fisher would have done it - is to look up p-values using tables covering several pages. There haven't really been any excuses for doing that since the advent of the personal computer though, so let's not go further into that. The "modern" version of the old-school t-test uses numerical evaluation of the formulas for Student's t-distribution to compute p-values and confidence intervals. Before we delve into more modern approaches, let's look at how we can run an old-school t-test in R.\index{hypothesis test!t-test}

In Section \@ref(firstttest) we used `t.test`\index{\texttt{t.test}} to run a t-test to see if there is a difference in how long carnivores and herbivores sleep, using the `msleep` data from `ggplot2`^[Note that this is not a random sample of mammals, and so one of the fundamental assumptions behind the t-test isn't valid in this case. For the purpose of showing how to use the t-test, the data is good enough though.]. First, we subtracted a subset of the data corresponding to carnivores and herbivores, and then we ran the test. There are in fact several different ways of doing this, and it is probably a good idea to have a look at them.

In the approach used in Section \@ref(firstttest) we created two vectors, using bracket notation, and then used those as arguments for `t.test`:

```{r eval=FALSE}
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
t.test(carnivores$sleep_total, herbivores$sleep_total)
```

Alternatively, we could have used formula notation, as we e.g. did for the linear model in Section \@ref(firstlm). We'd then have to use the `data` argument in `t.test` to supply the data. By using `subset`, we can do the subsetting simultaneously:

```{r eval=FALSE}
t.test(sleep_total ~ vore, data = 
         subset(msleep, vore == "carni" | vore == "herbi"))
```

Unless we are interested in keeping the vectors `carnivores` and `herbivores` for other purposes, this latter approach is arguably more elegant.

Speaking of elegance, the `data` argument also makes it easy to run a t-test using pipes. Here is an example, where we use `filter` from `dplyr` to do the subsetting:

```{r eval=FALSE}
library(dplyr)
msleep %>% filter(vore == "carni" | vore == "herbi") %>% 
        t.test(sleep_total ~ vore, data = .)
```

We could also use the `magrittr` pipe `%$%` from Section \@ref(morepipes) to pass the variables from the filtered subset of `msleep`, avoiding the `data` argument:

```{r eval=FALSE}
library(magrittr)
msleep %>% filter(vore == "carni" | vore == "herbi") %$% 
        t.test(sleep_total ~ vore)
```

There are even more options than this - the point that I'm trying to make here is that like most functions in R, you can use functions for classical statistics in many different ways. In what follows, I will show you one or two of these, but don't hesitate to try out other approaches if they seem better to you.

What we just did above was a two-sided t-test, where the null hypothesis was that there was no difference in means between the groups, and the alternative hypothesis that there was a difference. We can also perform one-sided tests using the `alternative` argument. `alternative = "greater"` means that the alternative is that the first group has a greater mean, and `alternative = "less"` means that the first group has a smaller mean. Here is an example with the former:

```{r eval=FALSE}
t.test(sleep_total ~ vore,
       data = subset(msleep, vore == "carni" | vore == "herbi"),
       alternative = "greater")
```

By default, R uses the Welch two-sample t-test, meaning that it is _not_ assumed that the groups have equal variances. If you don't want to make that assumption, you can add `var.equal = TRUE`:

```{r eval=FALSE}
t.test(sleep_total ~ vore,
       data = subset(msleep, vore == "carni" | vore == "herbi"),
       var.equal = TRUE)
```

In addition to two-sample t-tests, `t.test` can also be used for one-sample tests and paired t-tests. To perform a one-sample t-test, all we need to do is to supply a single vector with observations, along with the value of the mean $\mu$ under the null hypothesis. I usually sleep for about 7 hours each night, and so if I want to test whether that is true for an average mammal, I'd use the following:

```{r eval=FALSE}
t.test(msleep$sleep_total, mu = 7)
```

As we can see from the output, your average mammal sleeps for 10.4 hours per day. Moreover, the p-value is quite low - apparently, I sleep unusually little for a mammal!

As for paired t-tests, we can perform them by supplying two vectors (where element 1 of the first vector corresponds to element 1 of the second vector, and so on) and the argument `paired = TRUE`. For instance, using the `diamonds` data from `ggplot2`, we could run a test to see if the length `x` of diamonds with a fair quality of the cut on average equals the width `y`:

```{r eval=FALSE}
fair_diamonds <- subset(diamonds, cut == "Fair")
t.test(fair_diamonds$x, fair_diamonds$y, paired = TRUE)
```

$$\sim$$

```{exercise, label="ch7exc5"}
Load the VAS pain data `vas.csv` from Exercise \@ref(exr:ch3exc4). Perform a one-sided t-test to see test the null hypothesis that the average VAS among the patients during the time period is less than or equal to 6.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions5)


### Permutation tests
Maybe it was a little harsh to say that Gosset's formulas have become obsolete. The formulas are mathematical approximations to the distribution of the test statistics under the null hypothesis. The truth is that they work very well as long as your data is (nearly) normally distributed. The two-sample test also works well for non-normal data as long as you have balanced sample sizes, that is, equally many observations in both groups. However, for one-sample tests, and two-sample tests with imbalanced sample sizes, there are better ways to compute p-values and confidence intervals than to use Gosset's traditional formulas.\index{hypothesis test!permutation test}

The first option that we'll look at is permutation tests. Let's return to our mammal sleeping times example, where we wanted to investigate whether there are differences in how long carnivores and herbivores sleep on average:

```{r eval=FALSE}
t.test(sleep_total ~ vore, data = 
         subset(msleep, vore == "carni" | vore == "herbi"))
```

There are 19 carnivores and 32 herbivores - 51 animals in total. If there are no differences between the two groups, the `vore` labels offer no information about how long the animals sleep each day. Under the null hypothesis, the assignment of `vore` labels to different animals is therefore for all intents and purposes random. To find the distribution of the test statistic under the null hypothesis, we could look at all possible ways to assign 19 animals the label `carnivore` and 32 animals the label `herbivore`. That is, look at all permutations of the labels. The probability of a result at least as extreme as that obtained in our sample (in the direction of the alternative), i.e. the p-value, would then be the proportion of permutations that yield a result at least extreme as that in our sample. This is known as a permutation test.

Permutation tests were known to the likes of Gosset and Fisher (Fisher's exact test is a common example), but because the number of permutations of labels often tend to become quite large (76,000 billion, in our carnivore-herbivore example), they lacked the means actually to use them. 76,000 billion permutations may be too many even today, but we can obtain very good approximations of the p-values of permutation tests using simulation.

The idea is that we look at a large number of randomly selected permutations, and check for how many of them we obtain a test statistic that is more extreme than the sample test statistic. The law of large number guarantees that this proportion will converge to the permutation test p-value as the number of randomly selected permutations increases.

Let's have a go!

```{r eval=FALSE}
# Filter the data, to get carnivores and herbivores:
data <- subset(msleep, vore == "carni" | vore == "herbi")

# Compute the sample test statistic:
sample_t <- t.test(sleep_total ~ vore, data = data)$statistic

# Set the number of random permutations and create a vector to
# store the result in:
B <- 9999
permutation_t <- vector("numeric", B)

# Start progress bar:
pbar <- txtProgressBar(min = 0, max = B, style = 3)

# Compute the test statistic for B randomly selected permutations
for(i in 1:B)
{
      # Draw a permutation of the labels:
      data$vore <- sample(data$vore, length(data$vore),
                          replace = FALSE)
      
      # Compute statistic for permuted sample:
      permutation_t[i] <- t.test(sleep_total ~ vore,
                                 data = data)$statistic
      
      # Update progress bar
      setTxtProgressBar(pbar, i)
}
close(pbar)

# In this case, with a two-sided alternative hypothesis, a
# "more extreme" test statistic is one that has a larger
# absolute value than the sample test statistic.

# Compute approximate permutation test p-value:
mean(abs(permutation_t) > abs(sample_t))
```

In this particular example, the resulting p-value is pretty close to that from the old-school t-test. However, we will soon see examples where the two versions of the t-test differ more.

You may ask why we used 9,999 permutations and not 10,000. The reason is that we avoid p-values that are equal to traditional significance levels like 0.05 and 0.01 this way. If we'd used 10,000 permutations, 500 of which yielded a statistics that had a larger absolute value than the sample statistic, then the p-value would have been exactly 0.05, which would cause some difficulties in trying to determine whether or not the result was significant at the 5 % level. This cannot happen when we use 9,999 permutations instead (500 statistics with a large absolute value yields the p-value $0.050005>0.05$, and 499 yields the p-value $0.0499<0.05$).

Having to write a `for` loop every time we want to run a t-test seems unnecessarily complicated. Fortunately, others have tread this path before us. The `MKinfer`\index{\texttt{MKinfer}} package contains a function to perform (approximate) permutation t-tests, which also happens to be faster than our implementation above. Let's install it:

```{r eval=FALSE}
install.packages("MKinfer")
```

The function for the permutation t-test, `perm.t.test`\index{\texttt{perm.t.test}}\index{hypothesis test!t-test, permutation}, works exactly like `t.test`. In all the examples from Section \@ref(oldschoolt) we can replace `t.test` with `perm.t.test` to run a permutation t-test instead. Like so:

```{r eval=FALSE}
library(MKinfer)
perm.t.test(sleep_total ~ vore, data = 
              subset(msleep, vore == "carni" | vore == "herbi"))
```

Note that two p-values and confidence intervals are presented: one set from the permutations and one from the old-school approach - so make sure that you look at the right ones!

You may ask how many randomly selected permutations we need to get an accurate approximation of the permutation test p-value. By default, `perm.t.test` uses 9,999 permutations (you can change that number using the argument `R`), which is widely considered to be a reasonable number. If you are running a permutation test with a much more complex (and computationally intensive) statistic, you may have to use a lower number, but avoid that if you can.

### The bootstrap
A popular method for computing p-values and confidence intervals that resembles the permutation approach is the bootstrap. Instead of drawing permuted samples, new observations are drawn with replacement from the original sample, and then labels are randomly allocated to them. That means that each randomly drawn sample will differ not only in the permutation of labels, but also in what observations are included - some may appear more than once and some not at all.

We will have a closer look at the bootstrap in Section \@ref(bootstrap), where we will learn how to use it for creating confidence intervals and computing p-values for any test statistic. For now, we'll just note that `MKinfer` offers a bootstrap version of the t-test, `boot.t.test` \index{\texttt{boot.t.test}}\index{hypothesis test!t-test, bootstrap}:

```{r eval=FALSE}
library(MKinfer)
boot.t.test(sleep_total ~ vore, data = 
              subset(msleep, vore == "carni" | vore == "herbi"))
```

Both `perm.test` and `boot.test` have a useful argument called `symmetric`, the details of which are discussed in depth in Section \@ref(twotypesofpvalues).

### Saving the output
When we run a t-test, the results are printed in the Console. But we can also store the results in a variable, which allows us to access e.g. the p-value of the test:

```{r eval=FALSE}
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
test_result <- t.test(sleep_total ~ vore, data = 
                  subset(msleep, vore == "carni" | vore == "herbi"))

test_result
```

What does the resulting object look like?

```{r eval=FALSE}
str(test_result)
```

As you can see, `test_result` is a `list` containing different parameters and vectors for the test. To get the p-value, we can run the following:

```{r eval=FALSE}
test_result$p.value
```

### Multiple testing {#multipletesting}
Some programming tools from Section \@ref(loopsection) can be of use if we wish to perform multiple t-tests. For example, maybe we want to make pairwise comparisons of the sleeping times of all the different feeding behaviours in `msleep`: carnivores, herbivores, insectivores and omnivores\index{hypothesis test!multiple testing}. To find all possible pairs, we can use a nested `for` loop (Section \@ref(nestedloops)). Note how the indices `i` and `j` that we loop over are set so that we only run the test for each combination once:

```{r eval=FALSE}
library(MKinfer)

# List the different feeding behaviours (ignoring NA's):
vores <- na.omit(unique(msleep$vore))
B <- length(vores)

# Compute the number of pairs, and create an appropriately
# sized data frame to store the p-values in:
n_comb <- choose(B, 2)
p_values <- data.frame(group1 = vector("character", n_comb),
                       group2 = vector("character", n_comb),
                       p = vector("numeric", n_comb))

# Loop over all pairs:
k <- 1 # Counter variable
for(i in 1:(B-1))
{
      for(j in (i+1):B)
      {
            # Run a t-test for the current pair:
            test_res <- perm.t.test(sleep_total ~ vore, 
                   data = subset(msleep,
                                 vore == vores[i] | vore == vores[j]))
            # Store the p-value:
            p_values[k, ] <- c(vores[i], vores[j], test_res$p.value)
            # Increase the counter variable:
            k <- k + 1
      }
}
```

To view the p-values for each pairwise test, we can now run:

```{r eval=FALSE}
p_values
```

When we run multiple tests, the risk for a type I error increases, to the point where we're virtually guaranteed to get a significant result. We can reduce the risk of false positive results and adjust the p-values for multiplicity using for instance Bonferroni correction, Holm's method (an improved version of the standard Bonferroni approach), or the Benjamini-Hochberg approach (which controls the _false discovery rate_ and is useful if you for instance are screening a lot of variables for differences), using `p.adjust`\index{\texttt{p.adjust}}:

```{r eval=FALSE}
p.adjust(p_values$p, method = "bonferroni")
p.adjust(p_values$p, method = "holm")
p.adjust(p_values$p, method = "BH")
```

### Multivariate testing with Hotelling's $T^2$ {#hotellingst2}
If you are interested in comparing the means of several variables for two groups, using a multivariate test is sometimes a better option than running multiple univariate t-tests. The multivariate generalisation of the t-test, Hotelling's $T^2$, is available through the `Hotelling`\index{\texttt{Hotelling}} package:

```{r eval=FALSE}
install.packages("Hotelling")
```

As an example, consider the `airquality` data. Let's say that we want to test whether the mean ozone, solar radiation, wind speed, and temperature differ between June and July. We could use four separate t-tests to test this, but we could also use Hotelling's $T^2$ to test the null hypothesis that the mean vector, i.e. the vector containing the four means, is the same for both months. The function used for this is `hotelling.test`\index{\texttt{hotelling.test}}:

```{r eval=FALSE}
# Subset the data:
airquality_t2 <- subset(airquality, Month == 6 | Month == 7)

# Run the test under the assumption of normality:
library(Hotelling)
t2 <- hotelling.test(Ozone + Solar.R + Wind + Temp ~ Month,
               data = airquality_t2)
t2

# Run a permutation test instead:
t2 <- hotelling.test(Ozone + Solar.R + Wind + Temp ~ Month,
               data = airquality_t2, perm = TRUE)
t2
```


### Sample size computations for the t-test
In any study, it is important to collect enough data for the inference that we wish to make. If we want to use a t-test for a test about a mean or the difference of two means, what constitutes "enough data" is usually measured by the power of the test. The sample is large enough when the test achieves high enough power. If we are comfortable assuming normality (and we may well be, especially as the main goal with sample size computations is to get a ballpark figure), we can use `power.t.test`\index{\texttt{power.t.test}} to compute what power our test would achieve under different settings. For a two-sample test with unequal variances, we can use `power.welch.t.test`\index{\texttt{power.welch.t.test}} from `MKpower`\index{\texttt{MKpower}} instead. Both functions can be used to either find the sample size required for a certain power, or to find out what power will be obtained from a given sample size.

First of all, let's install `MKpower`:

```{r eval=FALSE}
install.packages("MKpower")
```

`power.t.test` and `power.welch.t.test` both use `delta` to denote the mean difference under the alternative hypothesis. In addition, we must supply the standard deviation`sd` of the distribution. Here are some examples:

```{r eval=FALSE}
library(MKpower)

# A one-sided one-sample test with 80 % power:
power.t.test(power = 0.8, delta = 1, sd = 1, sig.level = 0.05,
             type = "one.sample", alternative = "one.sided")

# A two-sided two-sample test with sample size n = 25 and equal
# variances:
power.t.test(n = 25, delta = 1, sd = 1, sig.level = 0.05,
             type = "two.sample", alternative = "two.sided")

# A one-sided two-sample test with 90 % power and equal variances:
power.t.test(power = 0.9, delta = 1, sd = 0.5, sig.level = 0.01,
             type = "two.sample", alternative = "one.sided")

# A one-sided two-sample test with 90 % power and unequal variances:
power.welch.t.test(power = 0.9, delta = 1, sd1 = 0.5, sd2 = 1,
                   sig.level = 0.01,
                   type = "two.sample", alternative = "one.sided")
```

You may wonder how to choose `delta` and `sd`. If possible, it is good to base these numbers on a pilot study or related previous work. If no such data is available, your guess is as good as mine. For `delta`, some useful terminology comes from medical statistics, where the concept of _clinical significance_ is used increasingly often. Make sure that `delta` is large enough to be clinically significant, that is, large enough to actually matter in practice.

If we have reason to believe that the data follows a non-normal distribution, another option is to use simulation to compute the sample size that will be required. We'll do just that in Section \@ref(sscus).

```{exercise, label="ch7exc7"}
Return to the one-sided t-test that you performed in Exercise \@ref(exr:ch7exc5). Assume that `delta` is 0.5 (i.e. that the true mean is 6.5) and that the standard deviation is 2. How large does the sample size $n$ have to be for the power of the test to be 95 % at a 5 % significance level? What is the power of the test when the sample size is $n=2,351$?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions7)


### A Bayesian approach
The Bayesian paradigm differs in many ways from the frequentist approach that we use in the rest of this chapter. In Bayesian statistics, we first define a _prior distribution_ for the parameters that we are interested in, representing our beliefs about them (for instance based on previous studies). Bayes' theorem is then used to derive the _posterior distribution_, i.e. the distribution of the coefficients given the prior distribution and the data. Philosophically, this is very different from frequentist estimation, in which we don't incorporate prior beliefs into our models (except for through which variables we include).

In many situations, we don't have access to data that can be used to create an _informative_ prior distribution. In such cases, we can use a so-called weakly informative prior instead. These act as a sort of "default priors", representing large uncertainty about the values of the coefficients.

The `rstanarm` package contains methods for using Bayesian estimation to fit some common statistical models. It takes a while to install, but it is well worth it:

```{r eval=FALSE}
install.packages("rstanarm")
```

To use a Bayesian model with a weakly informative prior to analyse the difference in sleeping time between herbivores and carnivores, we load `rstanarm` and use `stan_glm`\index{\texttt{glm}} in complete analogue with how we use `t.test`:

```{r eval=FALSE}
library(rstanarm)
library(ggplot2)
m <- stan_glm(sleep_total ~ vore, data = 
         subset(msleep, vore == "carni" | vore == "herbi"))

# Print the estimates:
m
```

There are two estimates here: an "intercept" (the average sleeping time for carnivores) and `voreherbi` (the difference between carnivores and herbivores). To plot the posterior distribution of the difference, we can use `plot`:

```{r eval=FALSE}
plot(m, "dens", pars = c("voreherbi"))
```

To get a 95 % credible interval (the Bayesian equivalent of a confidence interval) for the difference, we can use `posterior_interval`\index{\texttt{posterior\_interval}} as follows:

```{r eval=FALSE}
posterior_interval(m, 
        pars = c("voreherbi"),
        prob = 0.95)
```

p-values are not a part of Bayesian statistics, so don't expect any. It is however possible to perform a kind of Bayesian test of whether there is a difference by checking whether the credible interval for the difference contains 0. If not, there is evidence that there is a difference (Thulin, 2014c). In this case, 0 is contained in the interval, and there is no evidence of a difference.

In most cases, Bayesian estimation is done using Monte Carlo integration (specifically, a class of methods known as Markov Chain Monte Carlo, MCMC). To check that the model fitting has converged, we can use a measure called $\hat{R}$. It should be less than 1.1 if the fitting has converged:

```{r eval=FALSE}
plot(m, "rhat")
```

If the model fitting hasn't converged, you may need to increase the number of iterations of the MCMC algorithm. You can increase the number of iterations by adding the argument `iter` to `stan_glm` (the default is 2,000).

If you want to use a custom prior for your analysis, that is of course possible too. See `?priors` and `?stan_glm` for details about this, and about the default weakly informative prior.


## Other common hypothesis tests and confidence intervals
There are thousands of statistical tests in addition to the t-test, and equally many methods for computing confidence intervals for different parameters. In this section we will have a look at some useful tools: the nonparametric Wilcoxon-Mann-Whitney test for location, tests for correlation, $\chi^2$-tests for contingency tables, and confidence intervals for proportions.

### Nonparametric tests of location
The Wilcoxon-Mann-Whitney test, `wilcox.test` in R\index{\texttt{wilcox.test}}, is a nonparametric alternative to the t-test that is based on ranks. `wilcox.test` can be used in complete analogue to `t.test`.

We can use two vectors as input:

```{r eval=FALSE}
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
wilcox.test(carnivores$sleep_total, herbivores$sleep_total)
```

Or use a formula:

```{r eval=FALSE}
wilcox.test(sleep_total ~ vore, data =
              subset(msleep, vore == "carni" | vore == "herbi"))
```

### Tests for correlation
To test the null hypothesis that two numerical variables are correlated, we can use `cor.test`\index{\texttt{cor.test}}\index{hypothesis test!correlation}. Let's try it with sleeping times and brain weight, using the `msleep` data again:

```{r eval=FALSE}
library(ggplot2)
cor.test(msleep$sleep_total, msleep$brainwt,
         use = "pairwise.complete")
```

The setting `use = "pairwise.complete"` means that `NA` values are ignored.

`cor.test` doesn't have a `data` argument, so if you want to use it in a pipeline I recommend using the `%$%` pipe (Section \@ref(morepipes)) to pass on the vectors from your data frame:

```{r eval=FALSE}
library(magrittr)
msleep %$% cor.test(sleep_total, brainwt, use = "pairwise.complete")
```

The test we just performed uses the Pearson correlation coefficient as its test statistic. If you prefer, you can use the nonparametric Spearman and Kendall correlation coefficients in the test instead, by changing the value of `method`:

```{r eval=FALSE}
# Spearman test of correlation:
cor.test(msleep$sleep_total, msleep$brainwt,
         use = "pairwise.complete",
         method = "spearman")
```

These tests are all based on asymptotic approximations, which among other things causes the Pearson correlation test perform poorly for non-normal data. In Section \@ref(bootstrap) we will create a bootstrap version of the correlation test, which has better performance.

### $\chi^2$-tests
$\chi^2$ (chi-squared) tests are most commonly used to test whether two categorical variables are independent. To use it, we must first construct a contingency table, i.e. a table showing the counts for different combinations of categories, typically using `table`. Here is an example with the `diamonds` data from `ggplot2`:

```{r eval=FALSE}
library(ggplot2)
table(diamonds$cut, diamonds$color)
```

The null hypothesis of our test is that the quality of the cut (`cut`) and the colour of the diamond (`color`) are independent, with the alternative being that they are dependent. We use `chisq.test`\index{\texttt{chisq.test}}\index{hypothesis test!independence}\index{hypothesis test!chi-squared} with the contingency table as input to run the $\chi^2$ test of independence:

```{r eval=FALSE}
chisq.test(table(diamonds$cut, diamonds$color))
```

By default, `chisq.test` uses an asymptotic approximation of the p-value. For small sample sizes, it is almost often better to use permutation p-values by setting `simulate.p.value = TRUE` (but here the sample is not small, and so the computation of the permutation test will take a while):

```{r eval=FALSE}
chisq.test(table(diamonds$cut, diamonds$color),
           simulate.p.value = TRUE)
```

As with `t.test`, we can use pipes to perform the test if we like:

```{r eval=FALSE}
library(magrittr)
diamonds %$% table(cut, color) %>% 
      chisq.test()
```

If both of the variables are binary, i.e. only take two values, the power of the test can be approximated using `power.prop.test`\index{\texttt{power.prop.test}}. Let's say that we have two variables, $X$ and $Y$, taking the values 0 and 1. Assume that we collect $n$ observations with $X=0$ and $n$ with $X=1$. Furthermore, let `p1` be the probability that $Y=1$ if $X=0$ and `p2` be the probability that $Y=1$ if $X=1$. We can then use `power.prop.test` as follows:

```{r eval=FALSE}
# Assume that n = 50, p1 = 0.4 and p2 = 0.5 and compute the power:
power.prop.test(n = 50, p1 = 0.4, p2 = 0.5, sig.level = 0.05)

# Assume that p1 = 0.4 and p2 = 0.5 and that we want 85 % power.
# To compute the sample size required:
power.prop.test(power = 0.85, p1 = 0.4, p2 = 0.5, sig.level = 0.05)
```


### Confidence intervals for proportions {#confprop}
The different t-test functions provide confidence intervals for means and differences of means. But what about proportions? The `binomCI`\index{\texttt{binomCI}}\index{confidence interval!proportion} function in the `MKinfer` package allows us to compute confidence intervals for proportions from binomial experiments using a number of methods. The input is the number of "successes" `x`, the sample size `n`, and the `method` to be used.

Let's say that we want to compute a confidence interval for the proportion of herbivore mammals that sleep for more than 7 hours a day.

```{r eval=FALSE}
library(ggplot2)
herbivores <- msleep[msleep$vore == "herbi",]

# Compute the number of animals for which we know the sleeping time:
n <- sum(!is.na(herbivores$sleep_total))

# Compute the number of "successes", i.e. the number of animals
# that sleep for more than 7 hours:
x <- sum(herbivores$sleep_total > 7, na.rm = TRUE)
```

The estimated proportion is `x/n`, which in this case is 0.625. We'd like to quantify the uncertainty in this estimate by computing a confidence interval. The standard Wald method, taught in most introductory courses, can be computed using:

```{r eval=FALSE}
library(MKinfer)
binomCI(x, n, conf.level = 0.95, method = "wald")
```

Don't do that though! The Wald interval is known to be severely flawed (Brown et al., 2001), and much better options are available. If the proportion can be expected to be close to 0 or 1, the Clopper-Pearson interval is recommended, and otherwise the Wilson interval is the best choice (Thulin, 2014a):

```{r eval=FALSE}
binomCI(x, n, conf.level = 0.95, method = "clopper-pearson")
binomCI(x, n, conf.level = 0.95, method = "wilson")
```

An excellent Bayesian credible interval is the Jeffreys interval, which uses the weakly informative Jeffreys prior:

```{r eval=FALSE}
binomCI(x, n, conf.level = 0.95, method = "jeffreys")
```

The `ssize.propCI`\index{\texttt{ssize.propCI}} function in `MKpower` can be used to compute the sample size needed to obtain a confidence interval with a given width^[Or rather, a given _expected_, or average, width. The width of the interval is a function of a random variable, and is therefore also random.]. It relies on asymptotic formulas that are highly accurate, as you later on will verify in Exercise \@ref(exr:ch7excCP).

```{r eval=FALSE}
library(MKpower)
# Compute the sample size required to obtain an interval with
# width 0.1 if the true proportion is 0.4:
ssize.propCI(prop = 0.4, width = 0.1, method = "wilson")
ssize.propCI(prop = 0.4, width = 0.1, method = "clopper-pearson")
```

$$\sim$$

```{exercise, label="ch7exc8"}
The function `binomDiffCI`\index{\texttt{binomDiffCI}}\index{confidence interval!difference of proportions} from `MKinfer` can be used to compute a confidence interval for the _difference_ of two proportions. Using the `msleep` data, use it to compute a confidence interval for the difference between the proportion of herbivores that sleep for more than 7 hours a day and the proportion of carnivores that sleep for more than 7 hours a day.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions8)


## Ethical issues in statistical inference {#ethicsinference}
The use and misuse of statistical inference offer many ethical dilemmas. Some common issues related to ethics and good statistical practice are discussed below. As you read them and work with the associated exercises, consider consulting the ASA's ethical guidelines, presented in Section \@ref(ethicalguidelines).

### p-hacking and the file-drawer problem
Hypothesis tests are easy to misuse. If you run enough tests on your data, you are almost guaranteed to end up with significant results - either due to chance or because some of the null hypotheses you test are false. The process of trying lots of different tests (different methods, different hypotheses, different sub-groups) in search of significant results is known as _p-hacking_ or _data dredging_\index{p-hacking}. This greatly increases the risk of false findings, and can often produce misleading results.

Many practitioners inadvertently resort to p-hacking, by mixing exploratory data analysis and hypothesis testing, or by coming up with new hypotheses to test as they work with their data. This can be avoided by planning your analyses in advance, a practice that in fact is required in medical trials.

On the other end of the spectrum, there is the _file-drawer problem_, in which studies with negative (i.e. not statistically significant) results aren't published or reported, but instead are stored in the researcher's file-drawers. There are many reasons for this, one being that negative results usually are seen as less important and less worthy of spending time on. Simply put, negative results just aren't news. If your study shows that eating kale every day significantly reduces the risk of cancer, then that is news, something that people are interested in learning, and something that can be published in a prestigious journal. However, if your study shows that a daily serving of kale has no impact on the risk of cancer, that's not news, people aren't really interested in hearing it, and it may prove difficult to publish your findings.

But what if 100 different researchers carried out the same study? If eating kale doesn't affect the risk of cancer, then we can still expect 5 out of these researchers to get significant results (using a 5 % significance level). If only those researchers publish their results, that may give the impressions that there is strong evidence of the cancer-preventing effect of kale backed up by several papers, even though the majority of studies actually indicated that there was no such effect.

$$\sim$$

```{exercise, label="ch7ethics1"}
_Discuss the following._ You are helping a research team with statistical analysis of data that they have collected. You agree on five hypotheses to test. None of the tests turns out significant. Fearing that all their hard work won't lead anywhere, your collaborators then ask you to carry out five new tests. Neither turns out significant. Your collaborators closely inspect the data and then ask you to carry out ten more tests, two of which are significant. The team wants to publish these significant results in a scientific journal. Should you agree to publish them? If so, what results should be published? Should you have put your foot down and told them not to run more tests? Does your answer depend on how long it took the research team to collect the data? What if the team won't get funding for new projects unless they publish a paper soon? What if other research teams competing for the same grants do their analyses like this?
  
```

<br>

```{exercise, label="ch7ethics2"}
_Discuss the following._ You are working for a company that is launching a new product, a hair-loss treatment. In a small study, the product worked for 19 out of 22 participants (86 %). You compute a 95 % Clopper-Pearson confidence interval (Section \@ref(confprop)) for the proportion of successes and find that it is (0.65, 0.97). Based on this, the company wants to market the product as being 97 % effective. Is that acceptable to you? If not, how should it be marketed? Would your answer change if the product was something else (new running shoes that make you faster, a plastic film that protects smartphone screens from scratches, or contraceptives)? What if the company wanted to market it as being 86 % effective instead?

```

<br>

```{exercise, label="ch7ethics3"}
_Discuss the following._ You have worked long and hard on a project. In the end, to see if the project was a success, you run a hypothesis test to check if two variables are correlated. You find that they are not (p = 0.15). However, if you remove three outliers, the two variables are significantly correlated (p = 0.03). What should you do? Does your answer change if you only have to remove one outlier to get a significant result? If you have to remove ten outliers? 100 outliers? What if the p-value is 0.051 before removing the outliers and 0.049 after removing the outliers?

```

<br>

```{exercise, label="ch7ethics4"}
_Discuss the following._ You are analysing data from an experiment to see if there is a difference between two treatments. You estimate^[We'll discuss methods for producing such estimates in Section \@ref(simpower).] that given the sample size and the expected difference in treatment effects, the power of the test that you'll be using, i.e. the probability of rejecting the null hypothesis if it is false, is about 15 %. Should you carry out such an analysis? If not, how high does the power need to be for the analysis to be meaningful?

```

### Reproducibility
An analysis is _reproducible_\index{reproducibility} if it can be reproduced by someone else. By producing reproducible analyses, we make it easier for others to scrutinise our work. We also make all the steps in the data analysis transparent. This can act as a safeguard against data fabrication and data dredging.

In order to make an analysis reproducible, we need to provide at least two things. First, _the data_ - all unedited data files in their original format. This also includes _metadata_ with information required to understand the data (e.g. codebooks explaining variable names and codes used for categorical variables). Second, _the computer code_ used to prepare and analyse the data. This includes any wrangling and preliminary testing performed on the data.

As long as we save our data files and code, data wrangling and analyses in R are inherently reproducible, in contrast to the same tasks carried out in menu-based software such as Excel. However, if reports are created using a word processor, there is always a risk that something will be lost along the way. Perhaps numbers are copied by hand (which may introduce errors), or maybe the wrong version of a figure is pasted into the document. R Markdown (Section \@ref(rmarkdown)) is a great tool for creating completely reproducible reports, as it allows you to integrate R code for data wrangling, analyses, and graphics in your report-writing. This reduces the risk of manually inserting errors, and allows you to share your work with others easily.

$$\sim$$

```{exercise, label="ch7ethics5"}
_Discuss the following._ You are working on a study at a small-town hospital. The data involves biomarker measurements for a number of patients, and you show that patients with a sexually transmittable disease have elevated levels of some of the biomarkers. The data also includes information about the patients: their names, ages, ZIP codes, heights, and weights. The research team wants to publish your results and make the analysis reproducible. Is it ethically acceptable to share all your data? Can you make the analysis reproducible without violating patient confidentiality?
  
```


## Evaluating statistical methods using simulation {#simeval}
An important use of simulation is in the evaluation of statistical methods. In this section, we will see how simulation can be used to compare the performance of two estimators, as well as the type I error rate and power of hypothesis tests.


### Comparing estimators
Let's say that we want to estimate the mean $\mu$ of a normal distribution. We could come up with several different estimators for $\mu$:

* The sample mean $\bar{x}$,
* The sample median $\tilde{x}$,
* The average of the largest and smallest value in the sample: $\frac{x_{max}+x_{min}}{2}$.

In this particular case (under normality), statistical theory tells us that the sample mean is the best estimator^[At least in terms of mean squared error.]. But how much better is it, really? And what if we didn't know statistical theory - could we use simulation to find out which estimator to use?

To begin with, let's write a function that computes the estimate $\frac{x_{max}+x_{min}}{2}$:

```{r eval=FALSE}
max_min_avg <- function(x)
{
      return((max(x)+min(x))/2)
}
```

Next, we'll generate some data from a $N(0,1)$ distribution and compute the three estimates:

```{r eval=FALSE}
x <- rnorm(25)

x_mean <- mean(x)
x_median <- median(x)
x_mma <- max_min_avg(x)
x_mean; x_median; x_mma
```

As you can see, the estimates given by the different approaches differ, so clearly the choice of estimator matters. We can't determine which to use based on a single sample though. Instead, we typically compare the long-run properties of estimators, such as their _bias_ and _variance_. The bias is the difference between the mean of the estimator and the parameter it seeks to estimate. An estimator is _unbiased_ if its bias is 0, which is considered desirable at least in this setting. Among unbiased estimators, we prefer the one that has the smallest variance. So how can we use simulation to compute the bias and variance of estimators?\index{simulation!bias and variance}

The key to using simulation here is to realise that `x_mean` is an observation of the random variable $\bar{X}= \frac{1}{25}(X_1+X_2+\cdots+X_{25})$ where each $X_i$ is $N(0, 1)$-distributed. We can generate observations of $X_i$ (using `rnorm`), and can therefore also generate observations of $\bar{X}$. That means that we can obtain an arbitrarily large sample of observations of $\bar{X}$, which we can use to estimate its mean and variance. Here is an example:

```{r eval=FALSE}
# Set the parameters for the normal distribution:
mu <- 0
sigma <- 1

# We will generate 10,000 observations of the estimators:
B <- 1e4
res <- data.frame(x_mean = vector("numeric", B),
                  x_median = vector("numeric", B),
                  x_mma = vector("numeric", B))

# Start progress bar:
pbar <- txtProgressBar(min = 0, max = B, style = 3)

for(i in seq_along(res$x_mean))
{
      x <- rnorm(25, mu, sigma)
      res$x_mean[i] <- mean(x)
      res$x_median[i] <- median(x)
      res$x_mma[i] <- max_min_avg(x)
      
      # Update progress bar
      setTxtProgressBar(pbar, i)
}
close(pbar)

# Compare the estimators:
colMeans(res-mu) # Bias
apply(res, 2, var) # Variances
```

All three estimators appear to be unbiased (even if the simulation results aren't exactly 0, they are very close). The sample mean has the smallest variance (and is therefore preferable!), followed by the median. The $\frac{x_{max}+x_{min}}{2}$ estimator has the worst performance, which is unsurprising as it ignores all information not contained in the extremes of the dataset.

In Section \@ref(simadvice) we'll discuss how to choose the number of simulated samples to use in your simulations. For now, we'll just note that the estimate of the estimators' biases becomes more stable as the number of simulated samples increases, as can be seen from this plot, which utilises `cumsum`, described in Section \@ref(rle):

```{r eval=FALSE}
# Compute estimates of the bias of the sample mean for each
# iteration:
res$iterations <- 1:B
res$x_mean_bias <- cumsum(res$x_mean)/1:B - mu

# Plot the results:
library(ggplot2)
ggplot(res, aes(iterations, x_mean_bias)) +
      geom_line() +
      xlab("Number of iterations") +
      ylab("Estimated bias")

# Cut the x-axis to better see the oscillations for smaller
# numbers of iterations:
ggplot(res, aes(iterations, x_mean_bias)) +
      geom_line() +
      xlab("Number of iterations") +
      ylab("Estimated bias") +
      xlim(0, 1000)
```


$$\sim$$

```{exercise, label="ch7exc9"}
Repeat the above simulation for different samples sizes $n$ between 10 and 100. Plot the resulting variances as a function of $n$.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions9)


<br>

```{exercise, label="ch7exc10"}
Repeat the simulation in \@ref(exr:ch7exc9), but with a $t(3)$ distribution instead of the normal distribution. Which estimator is better in this case?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions10)


### Type I error rate of hypothesis tests {#simtypeI}
In the same vein that we just compared estimators, we can also compare hypothesis tests or confidence intervals. Let's have a look at the former, and evaluate how well the old-school two-sample t-test fares compared to a permutation t-test and the Wilcoxon-Mann-Whitney test.

For our first comparison, we will compare the type I error rate of the three tests, i.e. the risk of rejecting the null hypothesis if the null hypothesis is true.\index{hypothesis test!simulate type I error rate}\index{simulation!type I error rate} Nominally, this is the significance level $\alpha$, which we set to be 0.05.

We write a function for such a simulation, to which we can pass the sizes `n1` and `n2` of the two samples, as well as a function `distr` to generate data:

```{r eval=FALSE}
# Load package used for permutation t-test:
library(MKinfer)

# Create a function for running the simulation:
simulate_type_I <- function(n1, n2, distr, level = 0.05, B = 999,
                            alternative = "two.sided", ...)
{
      # Create a data frame to store the results in:
      p_values <- data.frame(p_t_test = vector("numeric", B),
                             p_perm_t_test = vector("numeric", B),
                             p_wilcoxon = vector("numeric", B))
      
      # Start progress bar:
      pbar <- txtProgressBar(min = 0, max = B, style = 3)
      
      for(i in 1:B)
      {
            # Generate data:
            x <- distr(n1, ...)
            y <- distr(n2, ...)
            
            # Compute p-values:
            p_values[i, 1] <- t.test(x, y,
                               alternative = alternative)$p.value
            p_values[i, 2] <- perm.t.test(x, y,
                               alternative = alternative,
                               R = 999)$perm.p.value
            p_values[i, 3] <- wilcox.test(x, y,
                               alternative = alternative)$p.value
            
            # Update progress bar:
            setTxtProgressBar(pbar, i)
      }
      
      close(pbar)
      
      # Return the type I error rates:
      return(colMeans(p_values < level))
}
```

First, let's try it with normal data. The simulation takes a little while to run, primarily because of the permutation t-test, so you may want to take a short break while you wait.

```{r eval=FALSE}
simulate_type_I(20, 20, rnorm, B = 9999)
```

Next, let's try it with a lognormal distribution, both with balanced and imbalanced sample sizes. Increasing the parameter $\sigma$ (`sdlog`) increases the skewness of the lognormal distribution (i.e. makes it _more_ asymmetric and therefore less similar to the normal distribution), so let's try that to. In case you are in a rush, the results from my run of this code block can be found below it.

```{r eval=FALSE}
simulate_type_I(20, 20, rlnorm, B = 9999, sdlog = 1)
simulate_type_I(20, 20, rlnorm, B = 9999, sdlog = 3)
simulate_type_I(20, 30, rlnorm, B = 9999, sdlog = 1)
simulate_type_I(20, 30, rlnorm, B = 9999, sdlog = 3)
```

My results were:
 
```{r eval=FALSE}
# Normal distribution, n1 = n2 = 20:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.04760476    0.04780478    0.04680468 
   
# Lognormal distribution, n1 = n2 = 20, sigma = 1:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.03320332    0.04620462    0.04910491 
   
# Lognormal distribution, n1 = n2 = 20, sigma = 3:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.00830083    0.05240524    0.04590459 
   
# Lognormal distribution, n1 = 20, n2 = 30, sigma = 1:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.04080408    0.04970497    0.05300530 
   
# Lognormal distribution, n1 = 20, n2 = 30, sigma = 3:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.01180118    0.04850485    0.05240524    
```

What's noticeable here is that the permutation t-test and the Wilcoxon-Mann-Whitney test have type I error rates that are close to the nominal 0.05 in all five scenarios, whereas the t-test has too low a type I error rate when the data comes from a lognormal distribution. This makes the test too conservative in this setting. Next, let's compare the power of the tests.

### Power of hypothesis tests {#simpower}
The power of a test is the probability of rejecting the null hypothesis if it is false. To estimate that, we need to generate data under the alternative hypothesis. For two-sample tests of the mean, the code is similar to what we used for the type I error simulation above, but we now need two functions for generating data - one for each group, because the groups differ under the alternative hypothesis.\index{hypothesis test!simulate power}\index{simulation!power} Bear in mind that the alternative hypothesis for the two-sample test is that the two distributions differ in location, so the two functions for generating data should reflect that.

```{r eval=FALSE}
# Load package used for permutation t-test:
library(MKinfer)

# Create a function for running the simulation:
simulate_power <- function(n1, n2, distr1, distr2, level = 0.05,
                           B = 999, alternative = "two.sided")
{
      # Create a data frame to store the results in:
      p_values <- data.frame(p_t_test = vector("numeric", B),
                             p_perm_t_test = vector("numeric", B),
                             p_wilcoxon = vector("numeric", B))
      
      # Start progress bar:
      pbar <- txtProgressBar(min = 0, max = B, style = 3)
      
      for(i in 1:B)
      {
            # Generate data:
            x <- distr1(n1)
            y <- distr2(n2)
            
            # Compute p-values:
            p_values[i, 1] <- t.test(x, y,
                               alternative = alternative)$p.value
            p_values[i, 2] <- perm.t.test(x, y,
                               alternative = alternative,
                               R = 999)$perm.p.value
            p_values[i, 3] <- wilcox.test(x, y,
                               alternative = alternative)$p.value
            
            # Update progress bar:
            setTxtProgressBar(pbar, i)
      }
      
      close(pbar)
      
      # Return power:
      return(colMeans(p_values < level))
}
```

Let's try this out with lognormal data, where the difference in the log means is 1:

```{r eval=FALSE}
# Balanced sample sizes:
simulate_power(20, 20, function(n) { rlnorm(n,
                                            meanlog = 2, sdlog = 1) },
                        function(n) { rlnorm(n,
                                             meanlog = 1, sdlog = 1) },
                        B = 9999)

# Imbalanced sample sizes:
simulate_power(20, 30, function(n) { rlnorm(n,
                                            meanlog = 2, sdlog = 1) },
                        function(n) { rlnorm(n,
                                             meanlog = 1, sdlog = 1) },
                        B = 9999)
```

Here are the results from my runs:
```{r eval=FALSE}
# Balanced sample sizes:
     p_t_test p_perm_t_test    p_wilcoxon 
    0.6708671     0.7596760     0.8508851

# Imbalanced sample sizes:
     p_t_test p_perm_t_test    p_wilcoxon 
    0.6915692     0.7747775     0.9041904
```

Among the three, the Wilcoxon-Mann-Whitney test appears to be preferable for lognormal data, as it manages to obtain the correct type I error rate (unlike the old-school t-test) and has the highest power (although we would have to consider more scenarios, including different samples sizes, other differences of means, and different values of $\sigma$ to say for sure!).

Remember that both our estimates of power and type I error rates are proportions, meaning that we can use binomial confidence intervals to quantify the uncertainty in the estimates from our simulation studies. Let's do that for the lognormal setting with balanced sample sizes, using the results from my runs. The number of simulated samples were 9,999. For the t-test, the estimated type I error rate was `0.03320332`, which corresponds to $0.03320332\cdot9,999=332$ "successes". Similarly, there were 6,708 "successes" in the power study. The confidence intervals become:

```{r eval=FALSE}
library(MKinfer)
binomCI(332, 9999, conf.level = 0.95, method = "clopper-pearson")
binomCI(6708, 9999, conf.level = 0.95, method = "wilson")
```

$$\sim$$

```{exercise, label="ch7exc11"}
Repeat the simulation study of type I error rate and power for the old school t-test, permutation t-test and the Wilcoxon-Mann-Whitney test with $t(3)$-distributed data. Which test has the best performance? How much lower is the type I error rate of the old-school t-test compared to the permutation t-test in the case of balanced sample sizes?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions11)


### Power of some tests of location {#powerlocation}
The `MKpower` package contains functions for quickly performing power simulations for the old-school t-test and Wilcoxon-Mann-Whitney test in different settings. The arguments `rx` and `ry` are used to pass functions used to generate the random numbers, in line with the `simulate_power` function that we created above.

For the t-test, we can use `sim.power.t.test`\index{\texttt{sim.power.t.test}}:

```{r eval=FALSE}
library(MKpower)
sim.power.t.test(nx = 25, rx = rnorm, rx.H0 = rnorm,
                 ny = 25, ry = function(x) { rnorm(x, mean = 0.8) },
                 ry.H0 = rnorm)
```

For the Wilcoxon-Mann-Whitney test, we can use `sim.power.wilcox.test`\index{\texttt{sim.power.wilcox.test}} for power simulations:

```{r eval=FALSE}
library(MKpower)
sim.power.wilcox.test(nx = 10, rx = rnorm, rx.H0 = rnorm,
                      ny = 15,
                      ry = function(x) { rnorm(x, mean = 2) }, 
                      ry.H0 = rnorm)
```

### Some advice on simulation studies {#simadvice}
There are two things that you need to decide when performing a simulation study:

* How many _scenarios_ to include, i.e. how many different settings for the model parameters to study, and
* How many _iterations_ to use, i.e. how many simulated samples to create for each scenario.

The number of scenarios is typically determined by what the purpose of the study is. If you only are looking to compare two tests for a particular sample size and a particular difference in means, then maybe you only need that one scenario. On the other hand, if you want to know which of the two tests that is preferable in general, or for different sample sizes, or for different types of distributions, then you need to cover more scenarios. In that case, the number of scenarios may well be determined by how much time you have available or how many you can fit into your report.

As for the number of iterations to run, that also partially comes down to computational power. If each iteration takes a long while to run, it may not be feasible to run tens of thousands of iterations (some advice for speeding up simulations by using parallelisation can be found in Section \@ref(parallel)). In the best of all possible worlds, you have enough computational power available, and can choose the number of iterations freely. In such cases, it is often a good idea to use confidence intervals to quantify the uncertainty in your estimate of power, bias, or whatever it is that you are studying. For instance, the power of a test is estimated as the proportion of simulations in which the null hypothesis was rejected. This is a binomial experiment, and a confidence interval for the power can be obtained using the methods described in Section \@ref(confprop). Moreover, the `ssize.propCI` function described in said section can be used to determine the number of simulations that you need to obtain a confidence interval that is short enough for you to feel that you have a good idea about the actual power of the test.

As an example, if a small pilot simulation indicates that the power is about 0.8 and you want a confidence interval with width 0.01, the number of simulations needed can be computed as follows:

```{r eval=FALSE}
library(MKpower)
ssize.propCI(prop = 0.8, width = 0.01, method = "wilson")
```

In this case, you'd need 24,592 iterations to obtain the desired accuracy.

## Sample size computations using simulation {#sscus}
Using simulation to compare statistical methods is a key tool in methodological statistical research and when assessing new methods. In applied statistics, a use of simulation that is just as important is sample size computations. In this section we'll have a look at how simulations can be useful in determining sample sizes.

### Writing your own simulation {#simcorpower}
Suppose that we want to perform a correlation test and want to know how many observations we need to collect. As in the previous section, we can write a function to compute the power of the test:

```{r eval=FALSE}
simulate_power <- function(n, distr, level = 0.05, B = 999, ...)
{
      p_values <- vector("numeric", B)
      
      # Start progress bar:
      pbar <- txtProgressBar(min = 0, max = B, style = 3)
      
      for(i in 1:B)
      {
            # Generate bivariate data:
            x <- distr(n)
            
            # Compute p-values:
            p_values[i] <- cor.test(x[,1], x[,2], ...)$p.value
            
            # Update progress bar:
            setTxtProgressBar(pbar, i)
      }
      
      close(pbar)
      
      return(mean(p_values < level))
}
```

Under the null hypothesis of no correlation, the correlation coefficient is 0. We want to find a sample size that will give us 90 % power at the 5 % significance level, for different hypothesised correlations. We will generate data from a bivariate normal distribution, because it allows us to easily set the correlation of the generated data. Note that the mean and variance of the marginal normal distributions are nuisance variables, which can be set to 0 and 1, respectively, without loss of generality (because the correlation test is invariant under scaling and shifts in location).

First, let's try our power simulation function:

```{r eval=FALSE}
library(MASS) # Contains mvrnorm function for generating data
rho <- 0.5 # The correlation between the variables
mu <- c(0, 0)
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)

simulate_power(50, function(n) { mvrnorm(n, mu, Sigma) }, B = 999)
```

To find the sample size we need,\index{hypothesis test!simulate sample size}\index{simulation!sample size} we will write a new function containing a `while` loop (see Section \@ref(whileloop)), that performs the simulation for increasing values of $n$ until the test has achieved the desired power:

```{r eval=FALSE}
library(MASS)

power.cor.test <- function(n_start = 10, rho, n_incr = 5, power = 0.9,
                           B = 999, ...)
{
    # Set parameters for the multivariate normal distribution:
    mu <- c(0, 0)
    Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
    
    # Set initial values
    n <- n_start
    power_cor <- 0
    
    # Check power for different sample sizes:
    while(power_cor < power)
    {
        power_cor <- simulate_power(n,
                               function(n) { mvrnorm(n, mu, Sigma) },
                               B = B, ...)
        cat("n =", n, " - Power:", power_cor, "\n")
        n <- n + n_incr
    }
    
    # Return the result:
    cat("\nWhen n =", n, "the power is", round(power_cor, 2), "\n")
    return(n)
}
```

Let's try it out with different settings:

```{r eval=FALSE}
power.cor.test(n_start = 10, rho = 0.5, power = 0.9)
power.cor.test(n_start = 10, rho = 0.2, power = 0.8)
```

As expected, larger sample sizes are required to detect smaller correlations.

### The Wilcoxon-Mann-Whitney test
The `sim.ssize.wilcox.test`\index{\texttt{sim.ssize.wilcox.test}} in `MKpower` can be used to quickly perform sample size computations for the Wilcoxon-Mann-Whitney test, analogously to how we used `sim.power.wilcox.test` in Section \@ref(powerlocation):

```{r eval=FALSE}
library(MKpower)
sim.ssize.wilcox.test(rx = rnorm, ry = function(x) rnorm(x, mean = 2), 
                      power = 0.8, n.min = 3, n.max = 10,
                      step.size = 1)
```

$$\sim$$
```{exercise, label="ch7exc12"}
Modify the functions we used to compute the sample sizes for the Pearson correlation test to instead compute sample sizes for the Spearman correlation tests. For bivariate normal data, are the required sample sizes lower or higher than those of the Pearson correlation test?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions12)


<br>

```{exercise, label="ch7excCP"}
In Section \@ref(confprop) we had a look at some confidence intervals for proportions, and saw how `ssize.propCI` can be used to compute sample sizes for such intervals using asymptotic approximations.\index{confidence interval!simulate sample size} Write a function to compute the exact sample size needed for the Clopper-Pearson interval to achieve a desired expected (average) width. Compare your results to those from the asymptotic approximations. Are the approximations good enough to be useful?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsCP)


## Bootstrapping {#bootstrap}
The bootstrap can be used formany things, most notably for constructing confidence intervals and running hypothesis tests.\index{bootstrap}\index{bootstrap!inference}\index{hypothesis test!bootstrap}\index{confidence interval!bootstrap} These tend to perform better than traditional parametric methods, such as the old-school t-test and its associated confidence interval, when the distributional assumptions of the parametric methods aren't met.

Confidence intervals and hypothesis tests are always based on a _statistic_, i.e. a quantity that we compute from the samples. The statistic could be the sample mean, a proportion, the Pearson correlation coefficient, or something else. In traditional parametric methods, we start by assuming that our data follows some distribution. For different reasons, including mathematical tractability, a common assumption is that the data is normally distributed. Under that assumption, we can then derive the distribution of the statistic that we are interested in analytically, like Gosset did for the t-test. That distribution can then be used to compute confidence intervals and p-values.

When using a bootstrap method, we follow the same steps, but use the observed data and simulation instead. Rather than making assumptions about the distribution^[Well, sometimes we make assumptions about the distribution _and_ use the bootstrap. This is known as the parametric bootstrap, and is discussed in Section \@ref(parametricbootstrap).], we use the empirical distribution of the data. Instead of analytically deriving a formula that describes the statistic's distribution, we find a good approximation of the distribution of the statistic by using simulation. We can then use that distribution to obtain confidence intervals and p-values, just as in the parametric case.

The simulation step is important. We use a process known as _resampling_, where we repeatedly draw new observations _with replacement_ from the original sample. We draw $B$ samples this way, each with the same size $n$ as the original sample. Each randomly drawn sample - called a _bootstrap sample_ - will include different observations. Some observations from the original sample may appear more than once in a specific bootstrap sample, and some not at all. For each bootstrap sample, we compute the statistic in which we are interested. This gives us $B$ observations of this statistic, which together form what is called the _bootstrap distribution_ of the statistic. I recommend using $B=9,999$ or greater, but we'll use smaller $B$ in some examples, to speed up the computations.

### A general approach
The Pearson correlation test is known to be sensitive to deviations from normality. We can construct a more robust version of it using the bootstrap. To illustrate the procedure, we will use the `sleep_total` and `brainwt` variables from the `msleep` data. Here is the result from the traditional parametric Pearson correlation test:

```{r eval=FALSE}
library(ggplot2)

msleep %$% cor.test(sleep_total, brainwt, use = "pairwise.complete")
```

To find the bootstrap distribution of the Pearson correlation coefficient, we can use resampling with a `for` loop (Section \@ref(forloops)):

```{r eval=FALSE}
# Extract the data that we are interested in:
mydata <- na.omit(msleep[,c("sleep_total", "brainwt")])

# Resampling using a for loop:
B <- 999 # Number of bootstrap samples
statistic <- vector("numeric", B)
for(i in 1:B)
{
      # Draw row numbers for the bootstrap sample:
      row_numbers <- sample(1:nrow(mydata), nrow(mydata),
                            replace = TRUE)
      
      # Obtain the bootstrap sample:
      sample <- mydata[row_numbers,]
      
      # Compute the statistic for the bootstrap sample:
      statistic[i] <- cor(sample[, 1], sample[, 2])
}

# Plot the bootstrap distribution of the statistic:
ggplot(data.frame(statistic), aes(statistic)) +
         geom_histogram(colour = "black")
```

Because this is such a common procedure, there are R packages that let's us do resampling without having to write a `for` loop. In the remainder of the section, we will use the `boot` package\index{\texttt{boot}} to draw bootstrap samples. It also contains convenience functions that allows us to get confidence intervals from the bootstrap distribution quickly. Let's install it:

```{r eval=FALSE}
install.packages("boot")
```

The most important function in this package is `boot`, which does the resampling. As input, it takes the original data, the number $B$ of bootstrap samples to draw (called `R` here), and a function that computes the statistic of interest. This function should take the original data (`mydata` in our example above) and the row numbers of the sampled observation for a particular bootstrap sample (`row_numbers` in our example) as input.

For the correlation coefficient, the function that we input can look like this:

```{r eval=FALSE}
cor_boot <- function(data, row_numbers, method = "pearson")
{ 
    # Obtain the bootstrap sample:
    sample <- data[row_numbers,]
    
    # Compute and return the statistic for the bootstrap sample:
    return(cor(sample[, 1], sample[, 2], method = method))
}
```

To get the bootstrap distribution of the Pearson correlation coefficient for our data, we can now use `boot` as follows:

```{r eval=FALSE}
library(boot)

# Base solution:
boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]),
                 cor_boot,
                 999)
```

Next, we can plot the bootstrap distribution of the statistic computed in `cor_boot`:

```{r eval=FALSE}
plot(boot_res)
```

If you prefer, you can of course use a pipeline for the resampling instead:

```{r eval=FALSE}
library(boot)
library(dplyr)

# With pipes:
msleep %>% select(sleep_total, brainwt) %>% 
      drop_na %>% 
      boot(cor_boot, 999) -> boot_res
```

### Bootstrap confidence intervals
The next step is to use `boot.ci`\index{\texttt{boot.ci}} to compute bootstrap confidence intervals. This is as simple as running:

```{r eval=FALSE}
boot.ci(boot_res)
```

Four intervals are presented: normal, basic, percentile and BCa. The details concerning how these are computed based on the bootstrap distribution are presented in Section \@ref(bootstrapcimaths). It is generally agreed that the percentile and BCa intervals are preferable to the normal and basic intervals; see e.g. Davison & Hinkley (1997) and Hall (1992); but which performs the best varies.

We also receive a warning message:

```{r eval=FALSE}
Warning message:
In boot.ci(boot_res) : bootstrap variances needed for studentized
intervals
```

A fifth type of confidence interval, the studentised interval, requires bootstrap estimates of the standard error of the test statistic. These are obtained by running an _inner bootstrap_, i.e. by bootstrapping each bootstrap sample to get estimates of the variance of the test statistic. Let's create a new function that does this, and then compute the bootstrap confidence intervals:

```{r eval=FALSE}
cor_boot_student <- function(data, i, method = "pearson")
{ 
    sample <- data[i,]
    
    correlation <- cor(sample[, 1], sample[, 2], method = method)
    
    inner_boot <- boot(sample, cor_boot, 100)
    variance <- var(inner_boot$t)

    return(c(correlation, variance))
}

library(ggplot2)
library(boot)

boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]),
                 cor_boot_student,
                 999)

# Show bootstrap distribution:
plot(boot_res)

# Compute confidence intervals - including studentised:
boot.ci(boot_res)
```

While theoretically appealing (Hall, 1992), studentised intervals can be a little erratic in practice. I prefer to use percentile and BCa intervals instead.

For two-sample problems, we need to make sure that the number of observations drawn from each sample is the same as in the original data. The `strata` argument in `boot` is used to achieve this. Let's return to the example studied in Section \@ref(ttest), concerning the difference in how long carnivores and herbivores sleep. Let's say that we want a confidence interval for the difference of two means, using the `msleep` data. The simplest approach is to create a Welch-type interval, where we allow the two populations to have different variances. We can then resample from each population separately:

```{r eval=FALSE}
# Function that computes the mean for each group:
mean_diff_msleep <- function(data, i)
{ 
    sample1 <- subset(data[i, 1], data[i, 2] == "carni")
    sample2 <- subset(data[i, 1], data[i, 2] == "herbi")
    return(mean(sample1[[1]]) - mean(sample2[[1]]))
}

library(ggplot2) # Load the data
library(boot)    # Load bootstrap functions

# Create the data set to resample from:
boot_data <- na.omit(subset(msleep,
              vore == "carni" | vore == "herbi")[,c("sleep_total",
                                                          "vore")])
# Do the resampling - we specify that we want resampling from two
# populations by using strata:
boot_res <- boot(boot_data,
                 mean_diff_msleep,
                 999,
                 strata = factor(boot_data$vore))

# Compute confidence intervals:
boot.ci(boot_res, type = c("perc", "bca"))
```

$$\sim$$

```{exercise, label="ch7exc13"}
Let's continue the example with a confidence interval for the difference in how long carnivores and herbivores sleep. How can you create a confidence interval under the assumption that the two groups have equal variances?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions13)


### Bootstrap hypothesis tests {#intervalinversion}
Writing code for bootstrap hypothesis tests can be a little tricky, because the resampling must be done _under the null hypothesis_. The process is greatly simplified by computing p-values using _confidence interval inversion_ instead. This approach exploits the equivalence between confidence intervals and hypothesis tests, detailed in Section \@ref(confintequal). It relies on the fact that:

* The p-value of the test for the parameter $\theta$ is the smallest $\alpha$ such that $\theta$ is not contained in the corresponding $1-\alpha$ confidence interval.
* For a test for the parameter $\theta$ with significance level $\alpha$, the set of values of $\theta$ that aren't rejected by the test (when used as the null hypothesis) is a $1-\alpha$ confidence interval for $\theta$.

Here is an example of how we can use a `while` loop (Section \@ref(whileloop)) for confidence interval inversion, in order to test the null hypothesis that the Pearson correlation between sleeping time and brain weight is $\rho=-0.2$. It uses the studentised confidence interval that we created in the previous section:

```{r eval=FALSE}
# Compute the studentised confidence interval:
cor_boot_student <- function(data, i, method = "pearson")
{ 
    sample <- data[i,]
    
    correlation <- cor(sample[, 1], sample[, 2], method = method)
    
    inner_boot <- boot(sample, cor_boot, 100)
    variance <- var(inner_boot$t)

    return(c(correlation, variance))
}

library(ggplot2)
library(boot)

boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]),
                 cor_boot_student,
                 999)

# Now, a hypothesis test:
# The null hypothesis:
rho_null <- -0.2

# Set initial conditions:
in_interval <- TRUE
alpha <- 0

# Find the lowest alpha for which rho_null is in the interval:
while(in_interval)
{
    # Increase alpha a small step:
    alpha <- alpha + 0.001
    
    # Compute the 1-alpha confidence interval, and extract
    # its bounds:
    interval <- boot.ci(boot_res, 
                        conf = 1 - alpha,
                        type = "stud")$student[4:5]
    
    # Check if the null value for rho is greater than the lower
    # interval bound and smaller than the upper interval bound,
    # i.e. if it is contained in the interval:
    in_interval <- rho_null > interval[1] & rho_null < interval[2]
}
# The loop will finish as soon as it reaches a value of alpha such
# that rho_null is not contained in the interval.

# Print the p-value:
alpha
```

The `boot.pval`\index{\texttt{boot.pval}} package contains a function computing p-values through inversion of bootstrap confidence intervals. We can use it to obtain a bootstrap p-value without having to write a `while` loop. It works more or less analogously to `boot.ci`. The arguments to the `boot.pval` function is the `boot` object (`boot_res`), the type of interval to use (`"stud"`), and the value of the parameter under the null hypothesis (`-0.2`):

```{r eval=FALSE}
install.packages("boot.pval")
library(boot.pval)
boot.pval(boot_res, type = "stud", theta_null = -0.2)
```

Confidence interval inversion fails in spectacular ways for certain tests for parameters of discrete distributions (Thulin & Zwanzig, 2017), so be careful if you plan on using this approach with count data.

$$\sim$$

```{exercise, label="ch7exc14"}
With the data from Exercise \@ref(exr:ch7exc13), invert a percentile confidence interval to compute the p-value of the corresponding test of the null hypothesis that there is no difference in means. What are the results?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions14)


### The parametric bootstrap {#parametricbootstrap}
In some cases, we may be willing to make distributional assumptions about our data. We can then use the _parametric bootstrap_\index{bootstrap!parametric}, in which the resampling is done not from the original sample, but the theorised distribution (with parameters estimated from the original sample). Here is an example for the bootstrap correlation test, where we assume a multivariate normal distribution for the data. Note that we no longer include an index as an argument in the function `cor_boot`, because the bootstrap samples won't be drawn directly from the original data:

```{r eval=FALSE}
cor_boot <- function(data, method = "pearson")
{ 
    return(cor(data[, 1], data[, 2], method = method))
}

library(MASS)
generate_data <- function(data, mle)
{
    return(mvrnorm(nrow(data), mle[[1]], mle[[2]]))
}

library(ggplot2)
library(boot)

filtered_data <- na.omit(msleep[,c("sleep_total", "brainwt")])
boot_res <- boot(filtered_data,
                 cor_boot,
                 R = 999,
                 sim = "parametric",
                 ran.gen = generate_data,
                 mle = list(colMeans(filtered_data),
                            cov(filtered_data)))

# Show bootstrap distribution:
plot(boot_res)

# Compute bootstrap percentile confidence interval:
boot.ci(boot_res, type = "perc")
```

The BCa interval implemented in `boot.ci` is not valid for parametric bootstrap samples, so running `boot.ci(boot_res)` without specifying the interval `type` will render an error^[If you really need a BCa interval for the parametric bootstrap, you can find the formulas for it in Davison & Hinkley (1997).]. Percentile intervals work just fine, though.


## Reporting statistical results
Carrying out a statistical analysis is only the first step. After that, you probably need to communicate your results to others: your boss, your colleagues, your clients, the public... This section contains some tips for how best to do that.

### What should you include?
When reporting your results, it should always be clear:

* How the data was collected,
* If, how, and why any observations were removed from the data prior to the analysis,
* What method was used for the analysis (including a reference unless it is a routine method),
* If any other analyses were performed/attempted on the data, and if you don't report their results, why.

Let's say that you've estimate some parameter, for instance the mean sleeping time of mammals, and want to report the results. The first thing to think about is that you shouldn't include too many decimals: don't give the mean with 5 decimals if sleeping times only were measured with one decimal.

> **BAD:** The mean sleeping time of mammals was found to be 10.43373.

> **GOOD:** The mean sleeping time of mammals was found to be 10.4.

It is common to see estimates reported with standard errors or standard deviations:

> **BAD:** The mean sleeping time of mammals was found to be 10.3 ($\sigma=4.5$).

or

> **BAD:** The mean sleeping time of mammals was found to be 10.3 (standard error 0.49).

or

> **BAD:** The mean sleeping time of mammals was found to be $10.3 \pm 0.49$.

Although common, this isn't a very good practice. Standard errors/deviations are included to give some indication of the uncertainty of the estimate, but are very difficult to interpret. In most cases, they will probably cause the reader to either overestimate or underestimate the uncertainty in your estimate. A much better option is to present the estimate with a confidence interval, which quantifies the uncertainty in the estimate in an interpretable manner:

> **GOOD:** The mean sleeping time of mammals was found to be 10.3 (95 % percentile bootstrap confidence interval: 9.5-11.4).

Similarly, it is common to include error bars representing standard deviations and standard errors e.g. in bar charts. This questionable practice becomes even more troublesome because a lot of people fail to indicate what the error bars represent. If you wish to include error bars in your figures, they should always represent confidence intervals, unless you have a very strong reason for them to represent something else. In the latter case, make sure that you clearly explain what the error bars represent.

If the purpose of your study is to describe differences between groups, you should present a confidence interval for the difference between the groups, rather than one confidence interval (or error bar) for each group. It is possible for the individual confidence intervals to overlap even if there is a significant difference between the two groups, so reporting group-wise confidence intervals will only lead to confusion. If you are interested in the difference, then of course _the difference_ is what you should report a confidence interval for.

> **BAD:** There was no significant difference between the sleeping times of carnivores (mean 10.4, 95 % percentile bootstrap confidence interval: 8.4-12.5) and herbivores (mean 9.5, 95 % percentile bootstrap confidence interval: 8.1-12.6).

> **GOOD:** There was no significant difference between the sleeping times of carnivores (mean 10.4) and herbivores (mean 9.5), with the 95 % percentile bootstrap confidence interval for the difference being (-1.8, 3.5).

### Citing R packages
In statistical reports, it is often a good idea to specify what version of a software or a package that you used, for the sake of reproducibility (indeed, this is a requirement in some scientific journals). To get the citation information for the version of R that you are running, simply type `citation()`\index{\texttt{citation}}. To get the version number, you can use `R.Version`\index{\texttt{R.Version}} as follows:

```{r eval=FALSE}
citation()
R.Version()$version.string
```

To get the citation and version information for a package, use `citation` and `packageVersion`\index{\texttt{packageVersion}} as follows:

```{r eval=FALSE}
citation("ggplot2")
packageVersion("ggplot2")
```


# Regression models {#regression}
Regression models, in which explanatory variables are used to model the behaviour of a response variable, is without a doubt the most commonly used class of models in the statistical toolbox. In this chapter, we will have a look at different types of regression models tailored to many different sorts of data and applications.

After reading this chapter, you will be able to use R to:

* Fit and evaluate linear and generalised linear models,
* Fit and evaluate mixed models,
* Fit survival analysis models,
* Analyse data with left-censored observations,
* Create matched samples.

## Linear models {#linearmodels}
Being flexible enough to handle different types of data, yet simple enough to be useful and interpretable, linear models are among the most important tools in the statistics toolbox. In this section, we'll discuss how to fit and evaluate linear models in R.

### Fitting linear models
We had a quick glance at linear models in Section \@ref(firstlm). There we used the `mtcars`\index{data!\texttt{mtcars}} data:

```{r eval=FALSE}
?mtcars
View(mtcars)
```

First, we plotted fuel consumption (`mpg`) against gross horsepower (`hp`):

```{r eval=FALSE}
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
      geom_point()
```

Given $n$ observations of $p$ explanatory variables (also known as predictors, covariates, independent variables, and features), the linear model is:

$$y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n$$
where $\epsilon_i$ is a random error with mean 0, meaning that the model also can be written as:

$$E(y_i)=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n$$
We fitted a linear model using `lm`\index{\texttt{lm}}\index{linear regression!fit model}, with `mpg` as the response variable and `hp` as the explanatory variable:

```{r eval=FALSE}
m <- lm(mpg ~ hp, data = mtcars)
summary(m)
```

We added the fitted line to the scatterplot by using `geom_abline`:

```{r eval=FALSE}
# Check model coefficients:
coef(m)

# Add regression line to plot:
ggplot(mtcars, aes(hp, mpg)) +
      geom_point() + 
      geom_abline(aes(intercept = coef(m)[1], slope = coef(m)[2]),
                colour = "red")
```

We had a look at some diagnostic plots given by applying `plot` to our fitted model `m`:

```{r eval=FALSE}
plot(m)
```

Finally, we added another variable, the car weight `wt`, to the model: 

```{r eval=FALSE}
m <- lm(mpg ~ hp + wt, data = mtcars)
summary(m)
```

Next, we'll look at what more R has to offer when it comes to regression. Before that though, it's a good idea to do a quick exercise to make sure that you remember how to fit linear models.

$$\sim$$

```{exercise, label="ch7exc15"}
The `sales-weather.csv` data from Section \@ref(mergedata) describes the weather in a region during the first quarter of 2020. [Download the file from the book's web page](http://www.modernstatisticswithr.com/data.zip). Fit a linear regression model with `TEMPERATURE` as the response variable and `SUN_HOURS` as an explanatory variable. Plot the results. Is there a connection?

You'll return to and expand this model in the next few exercises, so make sure to save your code.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions15)


<br>

```{exercise, label="ch7exc15b"}
Fit a linear model to the `mtcars` data using the formula `mpg ~ .`.\index{\texttt{\textasciitilde .}} What happens? What is `~ .` a shorthand for?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions15b)

### Interactions and polynomial terms
It seems plausible that there could be an _interaction_ between gross horsepower and weight. We can include an interaction term by adding `hp:wt` to the formula\index{linear regression!interaction}:

```{r eval=FALSE}
m <- lm(mpg ~ hp + wt + hp:wt, data = mtcars)
summary(m)
```

Alternatively, to include the main effects of `hp` and `wt` along with the interaction effect, we can use `hp*wt` as a shorthand for `hp + wt + hp:wt` to write the model formula more concisely:

```{r eval=FALSE}
m <- lm(mpg ~ hp*wt, data = mtcars)
summary(m)
```

It is often recommended to centre the explanatory variables in regression models\index{linear regression!centre variables}, i.e. to shift them so that they all have mean 0. There are a number of benefits to this: for instance that the intercept then can be interpreted as the expected value of the response variable when all explanatory variables are equal to their means, i.e. in an average case^[If the variables aren't centred, the intercept is the expected value of the response variable when all explanatory variables are 0. This isn't always realistic or meaningful.]. It can also reduce any multicollinearity in the data, particularly when including interactions or polynomial terms in the model. Finally, it can reduce problems with numerical instability that may arise due to floating point arithmetics. Note however, that there is no need to centre the response variable^[On the contrary, doing so will usually only serve to make interpretation more difficult.].

Centring the explanatory variables can be done using `scale`\index{\texttt{scale}}:

```{r eval=FALSE}
# Create a new data frame, leaving the response variable mpg
# unchanged, while centring the explanatory variables:
mtcars_scaled <- data.frame(mpg = mtcars[,1],
                            scale(mtcars[,-1], center = TRUE,
                                         scale = FALSE))
m <- lm(mpg ~ hp*wt, data = mtcars_scaled)
summary(m)
```

If we wish to add a polynomial term to the model\index{linear regression!polynomial}, we can do so by wrapping the polynomial in `I()`\index{\texttt{I}}. For instance, to add a quadratic effect in the form of the square weight of a vehicle to the model, we'd use:

```{r eval=FALSE}
m <- lm(mpg ~ hp*wt + I(wt^2), data = mtcars_scaled)
summary(m)
```

### Dummy variables
Categorical variables can be included in regression models by using _dummy variables_\index{linear regression!dummy variable}. A dummy variable takes the values 0 and 1, indicating that an observation either belongs to a category (1) or not (0). If the original categorical variable has more than two categories, $c$ categories, say, the number of dummy variables included in the regression model should be $c-1$ (with the last category corresponding to all dummy variables being 0). R does this automatically for us if we include a `factor` variable in a regression model:

```{r eval=FALSE}
# Make cyl a categorical variable:
mtcars$cyl <- factor(mtcars$cyl)

m <- lm(mpg ~ hp*wt + cyl, data = mtcars)
summary(m)
```

Note how only two categories, 6 cylinders and 8 cylinders, are shown in the summary table. The third category, 4 cylinders, corresponds to both those dummy variables being 0. Therefore, the coefficient estimates for `cyl6` and `cyl8` are relative to the remaining _reference category_ `cyl4`. For instance, compared to `cyl4` cars, `cyl6` cars have a higher fuel consumption, with their `mpg` being $1.26$ lower.

We can control which category is used as the reference category by setting the order of the `factor` variable, as in Section \@ref(factors). The first `factor` level is always used as the reference, so if for instance we want to use `cyl6` as our reference category, we'd do the following:

```{r eval=FALSE}
# Make cyl a categorical variable with cyl6 as
# reference variable:
mtcars$cyl <- factor(mtcars$cyl, levels =
                       c(6, 4, 8))

m <- lm(mpg ~ hp*wt + cyl, data = mtcars)
summary(m)
```

Dummy variables are frequently used for modelling differences between different groups. Including only the dummy variable corresponds to using different intercepts for different groups. If we also include an interaction with the dummy variable, we can get different slopes for different groups. Consider the model $$E(y_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_{12} x_{i1}x_{i2},\qquad i=1,\ldots,n$$ where $x_1$ is numeric and $x_2$ is a dummy variable. Then the intercept and slope changes depending on the value of $x_2$ as follows:

$$E(y_i)=\beta_0+\beta_1 x_{i1},\qquad \mbox{if } x_2=0,$$
$$E(y_i)=(\beta_0+\beta_2)+(\beta_1+\beta_{12}) x_{i1},\qquad \mbox{if } x_2=1.$$
This yields a model where the intercept and slope differs between the two groups that $x_2$ represents.

$$\sim$$

```{exercise, label="ch7exc16"}
Return to the weather model from Exercise \@ref(exr:ch7exc15). Create a dummy variable for precipitation (zero precipitation or non-zero precipitation) and add it to your model. Also include an interaction term between the precipitation dummy and the number of sun hours. Are any of the coefficients significantly non-zero?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions16)


### Model diagnostics
There are a few different ways in which we can plot the fitted model.\index{linear regression!model diagnostics}\index{linear regression!plot} First, we can of course make a scatterplot of the data and add a curve showing the fitted values corresponding to the different points. These can be obtained by running `predict(m)`\index{\texttt{predict}} with our fitted model `m`.

```{r eval=FALSE}
# Fit two models:
mtcars$cyl <- factor(mtcars$cyl)
m1 <- lm(mpg ~ hp + wt, data = mtcars) # Simple model
m2 <- lm(mpg ~ hp*wt + cyl, data = mtcars) # Complex model

# Create data frames with fitted values:
m1_pred <- data.frame(hp = mtcars$hp, mpg_pred = predict(m1))
m2_pred <- data.frame(hp = mtcars$hp, mpg_pred = predict(m2))

# Plot fitted values:
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
      geom_point() + 
      geom_line(data = m1_pred, aes(x = hp, y = mpg_pred),
                colour = "red") + 
      geom_line(data = m2_pred, aes(x = hp, y = mpg_pred),
                colour = "blue")
```

We could also plot the observed values against the fitted values:

```{r eval=FALSE}
n <- nrow(mtcars)
models <- data.frame(Observed = rep(mtcars$mpg, 2),
                     Fitted = c(predict(m1), predict(m2)),
                     Model = rep(c("Model 1", "Model 2"), c(n, n)))

ggplot(models, aes(Fitted, Observed)) +
      geom_point(colour = "blue") +
      facet_wrap(~ Model, nrow = 3) +
      geom_abline(intercept = 0, slope = 1) +
      xlab("Fitted values") + ylab("Observed values")
```

Linear models are fitted and analysed using a number of assumptions, most of which are assessed by looking at plots of the model residuals\index{linear regression!residual}, $y_i-\hat{y}_i$, where $\hat{y}_i$ is the fitted value for observation $i$. Some important assumptions are:

* The model is _linear in the parameters_: we check this by looking for non-linear patterns in the residuals, or in the plot of observed against fitted values.
* _The observations are independent_: which can be difficult to assess visually. We'll look at models that are designed to handle correlated observations in Sections \@ref(mixedmodels) and \@ref(tsforecast).
* _Homoscedasticity_: that the random errors all have the same variance. We check this by looking for non-constant variance in the residuals. The opposite of homoscedasticity is heteroscedasticity.
* _Normally distributed random errors_: this assumption is important if we want to use the traditional parametric p-values, confidence intervals and prediction intervals. If we use permutation p-values or bootstrap intervals (as we will later in this chapter), we no longer need this assumption.

Additionally, residual plots can be used to find influential points that (possibly) have a large impact on the model coefficients (influence is measured using _Cook's distance_ and potential influence using _leverage_). We've already seen that we can use `plot(m)` to create some diagnostic plots. To get more and better-looking plots, we can use the `autoplot` function for `lm` objects from the `ggfortify` package:

```{r eval=FALSE}
library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)
```

In each of the plots, we look for the following:

* Residuals versus fitted: look for patterns that can indicate non-linearity, e.g. that the residuals all are high in some areas and low in others. The blue line is there to aid the eye - it should ideally be relatively close to a straight line (in this case, it isn't perfectly straight, which could indicate a mild non-linearity).
* Normal Q-Q: see if the points follow the line, which would indicate that the residuals (which we for this purpose can think of as estimates of the random errors) follow a normal distribution.
* Scale-Location: similar to the residuals versus fitted plot, this plot shows whether the residuals are evenly spread for different values of the fitted values. Look for patterns in how much the residuals vary - if they e.g. vary more for large fitted values, then that is a sign of heteroscedasticity. A horizontal blue line is a sign of homoscedasticity.
* Cook's distance: look for points with high values. A commonly-cited rule-of-thumb (Cook & Weisberg, 1982) says that values above 1 indicate points with a high influence.
* Residuals versus leverage: look for points with a high residual and high leverage. Observations with a high residual but low leverage deviate from the fitted model but don't affect it much. Observations with a high residual and a high leverage likely have a strong influence on the model fit, meaning that the fitted model could be quite different if these points were removed from the dataset.
* Cook's distance versus leverage: look for observations with a high Cook's distance and a high leverage, which are likely to have a strong influence on the model fit.

A formal test for heteroscedasticity, the Breusch-Pagan test, is available in the `car` package\index{\texttt{car}}\index{\texttt{ncvTest}} as a complement to graphical inspection. A low p-value indicates statistical evidence for heteroscedasticity. To run the test, we use `ncvTest` (where "ncv" stands for non-constant variance):

```{r eval=FALSE}
install.packages("car")
library(car)
ncvTest(m1)
```

A common problem in linear regression models is multicollinearity\index{linear regression!multicollinearity}, i.e. explanatory variables that are strongly correlated. Multicollinearity can cause your $\beta$ coefficients and p-values to change greatly if there are small changes in the data, rendering them unreliable. To check if you have multicollinearity in your data, you can create a scatterplot matrix of your explanatory variables, as in Section \@ref(scatterplotmatrix):

```{r eval=FALSE}
library(GGally)
ggpairs(mtcars[, -1])
```

In this case, there are some highly correlated pairs, `hp` and `disp` among them. As a numerical measure of collinearity, we can use the generalised variance inflation factor (GVIF), given by the `vif`\index{\texttt{VIF}} function in the `car` package:

```{r eval=FALSE}
library(car)
m <- lm(mpg ~ ., data = mtcars)
vif(m)
```

A high GVIF indicates that a variable is highly correlated with other explanatory variables in the dataset. Recommendations for what a "high GVIF" is varies, from 2.5 to 10 or more.

You can mitigate problems related to multicollinearity by:

* Removing one or more of the correlated variables from the model (because they are strongly correlated, they measure almost the same thing anyway!),
* Centring your explanatory variables (particularly if you include polynomial terms),
* Using a regularised regression model (which we'll do in Section \@ref(regularised)).

$$\sim$$

```{exercise, label="ch7exc17a"}
Below are two simulated datasets. One exhibits a nonlinear dependence between the variables, and the other exhibits heteroscedasticity. Fit a model with `y` as the response variable and `x` as the explanatory variable for each dataset, and make some residual plots. Which dataset suffers from which problem?
   
```{r eval=FALSE}
exdata1 <- data.frame(
   x = c(2.99, 5.01, 8.84, 6.18, 8.57, 8.23, 8.48, 0.04, 6.80,
         7.62, 7.94, 6.30, 4.21, 3.61, 7.08, 3.50, 9.05, 1.06,
         0.65, 8.66, 0.08, 1.48, 2.96, 2.54, 4.45),
   y = c(5.25, -0.80, 4.38, -0.75, 9.93, 13.79, 19.75, 24.65,
         6.84, 11.95, 12.24, 7.97, -1.20, -1.76, 10.36, 1.17,
         15.41, 15.83, 18.78, 12.75, 24.17, 12.49, 4.58, 6.76,
         -2.92))

exdata2 <- data.frame(
   x = c(5.70, 8.03, 8.86, 0.82, 1.23, 2.96, 0.13, 8.53, 8.18,
         6.88, 4.02, 9.11, 0.19, 6.91, 0.34, 4.19, 0.25, 9.72,
         9.83, 6.77, 4.40, 4.70, 6.03, 5.87, 7.49),
   y = c(21.66, 26.23, 19.82, 2.46, 2.83, 8.86, 0.25, 16.08,
         17.67, 24.86, 8.19, 28.45, 0.52, 19.88, 0.71, 12.19,
         0.64, 25.29, 26.72, 18.06, 10.70, 8.27, 15.49, 15.58,
         19.17))
```

`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions17a)

<br>

```{exercise, label="ch7exc17"}
We continue our investigation of the weather models from Exercises \@ref(exr:ch7exc15) and \@ref(exr:ch7exc16).

1. Plot the observed values against the fitted values for the two models that you've fitted. Does either model seem to have a better fit?

2. Create residual plots for the second model from Exercise \@ref(exr:ch7exc16). Are there any influential points? Any patterns? Any signs of heteroscedasticity?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions17)


### Transformations
If your data displays signs of heteroscedasticity or non-normal residuals, you can sometimes use a Box-Cox transformation (Box & Cox, 1964)\index{linear regression!transformations} to mitigate those problems. The Box-Cox transformation is applied to your dependent variable $y$. What it looks like is determined by a parameter $\lambda$. The transformation is defined as $\frac{y_i^\lambda-1}{\lambda}$ if $\lambda\neq 0$ and $\ln(y_i)$ if $\lambda=0$. $\lambda=1$ corresponds to no transformation at all. The `boxcox`\index{\texttt{boxcox}} function in `MASS` is useful for finding an appropriate choice of $\lambda$. Choose a $\lambda$ that is close to the peak (inside the interval indicated by the outer dotted lines) of the curve plotted by `boxcox`:

```{r eval=FALSE}
m <- lm(mpg ~ hp + wt, data = mtcars)

library(MASS)
boxcox(m)
```

In this case, the curve indicates that $\lambda=0$, which corresponds to a log-transformation, could be a good choice. Let's give it a go:

```{r eval=FALSE}
mtcars$logmpg <- log(mtcars$mpg)
m_bc <- lm(logmpg ~ hp + wt, data = mtcars)
summary(m_bc)

library(ggfortify)
autoplot(m_bc, which = 1:6, ncol = 2, label.size = 3)
```

The model fit seems to have improved after the transformation. The downside is that we now are modelling the log-mpg rather than mpg, which make the model coefficients a little difficult to interpret.

$$\sim$$

```{exercise, label="ch7exc18"}
Run `boxcox` with your model from Exercise \@ref(exr:ch7exc16). Does it indicate that a transformation can be useful for your model?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions18)


### Alternatives to `lm`
Non-normal regression errors can sometimes be an indication that you need to transform your data, that your model is missing an important explanatory variable, that there are interaction effects that aren't accounted for, or that the relationship between the variables is non-linear. But sometimes, you get non-normal errors simply because the errors are non-normal.

The p-values reported by `summary` are computed under the assumption of normally distributed regression errors, and can be sensitive to deviations from normality. An alternative is to use the `lmp` function from the `lmPerm`\index{linear regression!permutation test}\index{\texttt{lmp}}\index{\texttt{lmPerm}} package, which provides permutation test p-values instead. This doesn't affect the model fitting in any way - the only difference is how the p-values are computed. Moreover, the syntax for `lmp` is identical to that of `lm`:

```{r eval=FALSE}
# First, install lmPerm:
install.packages("lmPerm")

# Get summary table with permutation p-values:
library(lmPerm)
m <- lmp(mpg ~ hp + wt, data = mtcars)
summary(m)
```

In some cases, you need to change the arguments of `lmp` to get reliable p-values. We'll have a look at that in Exercise \@ref(exr:ch7exc21b). Relatedly, in Section \@ref(regbootstrap) we'll see how to construct bootstrap confidence intervals for the parameter estimates.

Another option that does affect the model fitting is to use a _robust_ regression model based on M-estimators\index{linear regression!robust}. Such models tend to be less sensitive to outliers, and can be useful if you are concerned about the influence of deviating points. The `rlm`\index{\texttt{rlm}} function in `MASS` is used for this. As was the case for `lmp`, the syntax for `rlm` is identical to that of `lm`:

```{r eval=FALSE}
library(MASS)
m <- rlm(mpg ~ hp + wt, data = mtcars)
summary(m)
```

Another option is to use Bayesian estimation, which we'll discuss in Section \@ref(bayeslm).

$$\sim$$

```{exercise, label="ch7exc19"}
Refit your model from Exercise \@ref(exr:ch7exc16) using `lmp`. Are the two main effects still significant?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions19)


### Bootstrap confidence intervals for regression coefficients {#regbootstrap}
Assuming normality, we can obtain parametric confidence intervals for the model coefficients\index{linear regression!confidence interval} using `confint`\index{\texttt{confint}}:

```{r eval=FALSE}
m <- lm(mpg ~ hp + wt, data = mtcars)

confint(m)
```

I usually prefer to use bootstrap confidence intervals, which we can obtain using `boot` and `boot.ci`, as we'll do next. Note that the only random part in the linear model
$$y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n$$
is the error term $\epsilon_i$. In most cases, it is therefore this term (and this term only) that we wish to resample. The explanatory variables should remain constant throughout the resampling process; the inference is conditioned on the values of the explanatory variables.

To achieve this, we'll resample from the model residuals, and add those to the values predicted by the fitted function, which creates new bootstrap values of the response variable. We'll then fit a linear model to these values, from which we obtain observations from the bootstrap distribution of the model coefficients.

It turns out that the bootstrap performs better if we resample not from the original residuals $e_1,\ldots,e_n$, but from scaled and centred residuals $r_i-\bar{r}$, where each $r_i$ is a scaled version of residual $e_i$, scaled by the leverage $h_i$:

$$r_i=\frac{e_i}{\sqrt{1-h_i}},$$
see Chapter 6 of Davison & Hinkley (1997) for details. The leverages can be computed using `lm.influence`\index{lm.influence}.

We implement this procedure in the code below (and will then have a look at convenience functions that help us achieve the same thing more easily). It makes use of `formula`\index{\texttt{formula}}, which can be used to extract the model formula from regression models:

```{r eval=FALSE}
library(boot)

coefficients <- function(formula, data, i, predictions, residuals) {
      # Create the bootstrap value of response variable by
      # adding a randomly drawn scaled residual to the value of
      # the fitted function for each observation:
      data[,all.vars(formula)[1]] <- predictions + residuals[i]
      
      # Fit a new model with the bootstrap value of the response
      # variable and the original explanatory variables:
      m <- lm(formula, data = data)
      return(coef(m))
}

# Fit the linear model:
m <- lm(mpg ~ hp + wt, data = mtcars)

# Compute scaled and centred residuals:
res <- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)

# Run the bootstrap, extracting the model formula and the
# fitted function from the model m:
boot_res <- boot(data = mtcars, statistic = coefficients,
                R = 999, formula = formula(m),
                predictions = predict(m),
                residuals = res)

# Compute 95 % confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # hp
boot.ci(boot_res, type = "perc", index = 3) # wt
```

The argument `index` in `boot.ci` should be the row number of the parameter in the table given by `summary`. The intercept is on the first row, and so its `index` is 1, `hp` is on the second row and its `index` is 2, and so on.

Clearly, the above code is a little unwieldy. Fortunately, the `car` package contains a function called `Boot`\index{\texttt{Boot}} that can be used to bootstrap regression models in the exact same way:

```{r eval=FALSE}
library(car)

boot_res <- Boot(m, method = "residual", R = 9999)

# Compute 95 % confidence intervals:
confint(boot_res, type = "perc")
```

Finally, the most convenient approach is to use `boot_summary`\index{\texttt{boot\_summary}} from the `boot.pval` package. It provides a data frame with estimates, bootstrap confidence intervals, and bootstrap p-values (computed using interval inversion) for the model coefficients. The arguments specify what interval type and resampling strategy to use (more on the latter in Exercise \@ref(exr:ch7exc20b)):

```{r eval=FALSE}
library(boot.pval)
boot_summary(m, type = "perc", method = "residual", R = 9999)
```

$$\sim$$

```{exercise, label="ch7exc20"}
Refit your model from Exercise \@ref(exr:ch7exc16) using a robust regression estimator with `rlm`. Compute confidence intervals for the coefficients of the robust regression model.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions20)


<br>

```{exercise, label="ch7exc20b"}
In an alternative bootstrap scheme for regression models, often referred to as _case resampling_, the observations (or cases) $(y_i, x_{i1},\ldots,x_{ip})$ are resampled instead of the residuals. This approach can be applied when the explanatory variables can be treated as being random (but measured without error) rather than fixed. It can also be useful for models with heteroscedasticity, as it doesn't rely on assumptions about constant variance (which, on the other hand, makes it less efficient if the errors actually are homoscedastic).

Read the documentation for `boot_summary` to see how you can compute confidence intervals for the coefficients in the model `m <- lm(mpg ~ hp + wt, data = mtcars)` using case resampling. Do they differ substantially from those obtained using residual resampling in this case?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions20b)

### Alternative summaries with `broom` {#broom}
The `broom`\index{\texttt{broom}} package contains some useful functions when working with linear models (and many other common models), which allow us to get various summaries of the model fit in useful formats. Let's install it:

```{r eval=FALSE}
install.packages("broom")
```

A model fitted with `m` is stored as a `list` with lots of elements:

```{r eval=FALSE}
m <- lm(mpg ~ hp + wt, data = mtcars)
str(m)
```

How can we access the information about the model? For instance, we may want to get the summary table from `summary`, but as a data frame rather than as printed text\index{\texttt{tidy}}. Here are two ways of doing this, using `summary` and the `tidy` function from `broom`:

```{r eval=FALSE}
# Using base R:
summary(m)$coefficients

# Using broom:
library(broom)
tidy(m)
```

`tidy` is the better option if you want to retrieve the table as part of a pipeline. For instance, if you want to adjust the p-values for multiplicity using Bonferroni correction (Section \@ref(multipletesting)), you could do as follows:

```{r eval=FALSE}
library(magrittr)
mtcars %>% 
    lm(mpg ~ hp + wt, data = .) %>%
    tidy() %$% 
    p.adjust(p.value, method = "bonferroni")
```

If you prefer bootstrap p-values, you can use `boot_summary` from `boot.pval` similarly. That function also includes an argument for adjusting the p-values for multiplicity:

```{r eval=FALSE}
library(boot.pval)
lm(mpg ~ hp + wt, data = mtcars) %>%
    boot_summary(adjust.method = "bonferroni")
```

Another useful function in `broom` is `glance`\index{\texttt{glance}}, which lets us get some summary statistics about the model:

```{r eval=FALSE}
glance(m)
```

Finally, `augment`\index{\texttt{augment}} can be used to add predicted values, residuals, and Cook's distances to the dataset used for fitting the model, which of course can be very useful for model diagnostics:

```{r eval=FALSE}
# To get the data frame with predictions and residuals added:
augment(m)

# To plot the observed values against the fitted values:
library(ggplot2)
mtcars %>% 
    lm(mpg ~ hp + wt, data = .) %>%
    augment() %>% 
    ggplot(aes(.fitted, mpg)) +
      geom_point() +
      xlab("Fitted values") + ylab("Observed values")
```


### Variable selection {#stepwise}
A common question when working with linear models is what variables to include in your model. Common practices for variable selection include stepwise regression methods, where variables are added to or removed from the model depending on p-values, $R^2$ values, or information criteria\index{linear regression!variable selection} like AIC or BIC.

**Don't ever do this if your main interest is p-values.** Stepwise regression increases the risk of type I errors, renders the p-values of your final model invalid, and can lead to over-fitting; see e.g. Smith (2018). Instead, you should let your research hypothesis guide your choice of variables, or base your choice on a pilot study.

If your main interest is prediction, then that is a completely different story. For predictive models, it is usually recommended that variable selection and model fitting should be done simultaneously. This can be done using regularised regression models, to which Section \@ref(regularised) is devoted.

### Prediction
An important use of linear models is prediction.\index{linear regression!prediction}\index{\texttt{predict}} In R, this is done using `predict`. By providing a fitted model and a new dataset, we can get predictions.

Let's use one of the models that we fitted to the `mtcars` data to make predictions for two cars that aren't from the 1970's. Below, we create a data frame with data for a 2009 Volvo XC90 D3 AWD (with a fuel consumption of 29 mpg) and a 2019 Ferrari Roma (15.4 mpg):

```{r eval=FALSE}
new_cars <- data.frame(hp = c(161, 612), wt = c(4.473, 3.462),
                       row.names = c("Volvo XC90", "Ferrari Roma"))
```

To get the model predictions for these new cars, we run the following:

```{r eval=FALSE}
predict(m, new_cars)
```

`predict` also lets us obtain prediction intervals\index{linear regression!prediction interval} for our prediction, under the assumption of normality^[Prediction intervals provide interval estimates for the new observations. They incorporate both the uncertainty associated with our model estimates, and the fact that the new observation is likely to deviate slightly from its expected value.]. To get 90 % prediction intervals, we add `interval = "prediction"` and `level = 0.9`:

```{r eval=FALSE}
m <- lm(mpg ~ hp + wt, data = mtcars)
predict(m, new_cars,
        interval = "prediction",
        level = 0.9)
```

If we were using a transformed $y$-variable, we'd probably have to transform the predictions back to the original scale for them to be useful:

```{r eval=FALSE}
mtcars$logmpg <- log(mtcars$mpg)
m_bc <- lm(logmpg ~ hp + wt, data = mtcars)

preds <- predict(m_bc, new_cars,
        interval = "prediction",
        level = 0.9)

# Predictions for log-mpg:
preds
# Transform back to original scale:
exp(preds)
```

The `lmp` function that we used to compute permutation p-values does not offer confidence intervals. We can however compute bootstrap prediction intervals\index{bootstrap!prediction interval} using the code below. Prediction intervals try to capture two sources of uncertainty:

* _Model uncertainty_, which we will capture by resampling the data and make predictions for the expected value of the observation,
* _Random noise_, i.e. that almost all observations deviate from their expected value. We will capture this by resampling residuals from the fitted bootstrap models.

Consequently, the value that we generate in each bootstrap replication will be the sum of a prediction and a resampled residual (see Davison & Hinkley (1997), Section 6.3, for further details):

```{r eval=FALSE}
boot_pred <- function(data, new_data, model, i,
                      formula, predictions, residuals){
      # Resample residuals and fit new model:
      data[,all.vars(formula)[1]] <- predictions + residuals[i]
      m_boot <- lm(formula, data = data)
      
      # We use predict to get an estimate of the
      # expectation of new observations, and then
      # add resampled residuals to also include the
      # natural variation around the expectation:
      predict(m_boot, newdata = new_data) + 
         sample(residuals, nrow(new_data))
}

library(boot)

m <- lm(mpg ~ hp + wt, data = mtcars)

# Compute scaled and centred residuals:
res <- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)

boot_res <- boot(data = m$model,
                     statistic = boot_pred,
                     R = 999,
                     model = m,
                     new_data = new_cars,
                     formula = formula(m),
                     predictions = predict(m),
                     residuals = res)

# 90 % bootstrap prediction intervals:
boot.ci(boot_res, type = "perc", index = 1, conf = 0.9) # Volvo
boot.ci(boot_res, type = "perc", index = 2, conf = 0.9) # Ferrari
```

$$\sim$$

```{exercise, label="ch7exc21"}
Use your model from Exercise \@ref(exr:ch7exc16) to compute a bootstrap prediction interval for the temperature on a day with precipitation but no sun hours.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions21)


### Prediction for multiple datasets {#multipredict}
In certain cases, we wish to fit different models to different subsets of the data. Functionals like `apply` and `map` (Section \@ref(vectorloops)) are handy when you want to fit several models at once. Below is an example of how we can use `split` (Section \@ref(splitvector)) and tools from the `purrr` package (Section \@ref(purrr)) to fit the models simultaneously, as well as for computing the fitted values in a single line of code:

```{r eval=FALSE}
# Split the dataset into three groups depending on the
# number of cylinders:
library(magrittr)
mtcars_by_cyl <- mtcars %>% split(.$cyl)

# Fit a linear model to each subgroup:
library(purrr)
models <- mtcars_by_cyl %>% map(~ lm(mpg ~ hp + wt, data = .))

# Compute the fitted values for each model:
map2(models, mtcars_by_cyl, predict)
```

We'll make use of this approach when we study linear mixed models in Section \@ref(mixedmodels).

### ANOVA {#ANOVA}
Linear models are also used for analysis of variance (_ANOVA_\index{ANOVA}) models to test whether there are differences among the means of different groups. We'll use the `mtcars` data to give some examples of this. Let's say that we want to investigate whether the mean fuel consumption (`mpg`) of cars differs depending on the number of cylinders (`cyl`), and that we want to include the type of transmission (`am`) as a blocking variable.

To get an ANOVA table for this problem, we must first convert the explanatory variables to `factor` variables, as the variables in `mtcars` all `numeric` (despite some of them being categorical). We can then use `aov`\index{\texttt{aov}} to fit the model, and then `summary`:

```{r eval=FALSE}
# Convert variables to factors:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)

# Fit model and print ANOVA table:
m <- aov(mpg ~ cyl + am, data = mtcars)
summary(m)
```

(`aov` actually uses `lm` to fit the model, but by using `aov` we specify that we want an ANOVA table to be printed by `summary`.)

When there are different numbers of observations in the groups in an ANOVA, so that we have an unbalanced design, the sums of squares used to compute the test statistics can be computed in at least three different ways, commonly called type I, II and III. See Herr (1986) for an overview and discussion of this.

`summary` prints a type I ANOVA table, which isn't the best choice for unbalanced designs. We can however get type II or III tables by instead using `Anova`\index{\texttt{Anova}} from the `car` package to print the table:

```{r eval=FALSE}
library(car)
Anova(m, type = "II")
Anova(m, type = "III") # Default in SAS and SPSS.
```

As a guideline, for unbalanced designs, you should use type II tables if there are no interactions, and type III tables if there are interactions. To look for interactions, we can use `interaction.plot`\index{\texttt{interaction.plot}} to create a two-way interaction plot:

```{r eval=FALSE}
interaction.plot(mtcars$am, mtcars$cyl, response = mtcars$mpg)
```

In this case, there is no sign of an interaction between the two variables, as the lines are more or less parallel. A type II table is therefore probably the best choice here.

We can obtain diagnostic plots the same way we did for other linear models:

```{r eval=FALSE}
library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3)
```

To find which groups that have significantly different means, we can use a post hoc test like Tukey's HSD\index{ANOVA!post hoc test}, available through the `TukeyHSD`\index{\texttt{TukeyHSD}} function:

```{r eval=FALSE}
TukeyHSD(m)
```

We can visualise the results of Tukey's HSD with `plot`, which shows 95 % confidence intervals for the mean differences:

```{r eval=FALSE}
# When the difference isn't significant, the dashed line indicating
# "no differences" falls within the confidence interval for
# the difference:
plot(TukeyHSD(m, "am"))

# When the difference is significant, the dashed line does not
# fall within the confidence interval:
plot(TukeyHSD(m, "cyl"))
```

$$\sim$$

```{exercise, label="ch7exc21a"}
Return to the residual plots that you created with `autoplot`. Figure out how you can plot points belonging to different `cyl` groups in different colours.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions21a)


<br>

```{exercise, label="ch7exc21b"}
The `aovp`\index{\texttt{aovp}} function in the `lmPerm` package can be utilised to perform permutation tests instead of the classical parametric ANOVA tests\index{ANOVA!permutation test}. Rerun the analysis in the example above, using `aovp` instead. Do the conclusions change? What happens if you run your code multiple times? Does using `summary` on a model fitted using `aovp` generate a type I, II or III table by default? Can you change what type of table it produces?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions21b)


<br>

```{exercise, label="ch7exc21c"}
In the case of a one-way ANOVA (i.e. ANOVA with a single explanatory variable), the Kruskal-Wallis\index{ANOVA!Kruskal-Wallis test} test can be used as a nonparametric option. It is available in `kruskal.test`\index{\texttt{kruskal.test}}. Use the Kruskal-Wallis test to run a one-way ANOVA for the `mtcars` data, with `mpg` as the response variable and `cyl` as an explanatory variable.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions21c)


### Bayesian estimation of linear models {#bayeslm}
We can fit Bayesian linear models\index{linear model!Bayesian estimation} using the `rstanarm` package. To fit a model to the `mtcars` data using all explanatory variables, we can use `stan_glm`\index{\texttt{stan\_glm}} in place of `lm` as follows:

```{r eval=FALSE}
library(rstanarm)
m <- stan_glm(mpg ~ ., data = mtcars)

# Print the estimates:
coef(m)
```

Next, we can plot the posterior distributions of the effects:

```{r eval=FALSE}
plot(m, "dens", pars = names(coef(m)))
```

To get 95 % credible intervals for the effects, we can use `posterior_interval`:

```{r eval=FALSE}
posterior_interval(m, 
        pars = names(coef(m)),
        prob = 0.95)
```

We can also plot them using `plot`:

```{r eval=FALSE}
plot(m, "intervals",
        pars = names(coef(m)),
        prob = 0.95)
```

Finally, we can use $\hat{R}$ to check model convergence. It should be less than 1.1 if the fitting has converged:

```{r eval=FALSE}
plot(m, "rhat")
```

Like for `lm`, `residuals(m)` provides the model residuals, which can be used for diagnostics. For instance, we can plot the residuals against the fitted values to look for signs of non-linearity, adding a curve to aid the eye:

```{r eval=FALSE}
model_diag <- data.frame(Fitted = predict(m),
                         Residual = residuals(m))

library(ggplot2)
ggplot(model_diag, aes(Fitted, Residual)) +
      geom_point() +
      geom_smooth(se = FALSE)
```

For fitting ANOVA models, we can instead use `stan_aov`\index{\texttt{stan\_aov}} with the argument `prior = R2(location = 0.5)` to fit the model.


## Ethical issues in regression modelling
The p-hacking problem, discussed in Section \@ref(ethicsinference), is perhaps particularly prevalent in regression modelling. Regression analysis often involves a large number of explanatory variables, and practitioners often try out several different models (e.g. by performing stepwise variable selection; see Section \@ref(stepwise)). Because so many hypotheses are tested, often in many different but similar models, there is a large risk of false discoveries. 

In any regression analysis, there is a risk of finding _spurious relationships_. These are dependencies between the response variable and an explanatory variable that either are non-causal or are purely coincidental. As an example of the former, consider the number of deaths by drowning, which is strongly correlated with ice cream sales. Not because ice cream cause people to drown, but because both are affected by the weather: we are more likely to go swimming or buy ice cream on hot days. Lurking variables, like the temperature in the ice cream-drowning example, are commonly referred to as _confounding factors_. An effect may be statistically significant, but that does not necessarily mean that it is meaningful.


$$\sim$$

```{exercise, label="ch8ethics1"}
_Discuss the following._ You are tasked with analysing a study on whether Vitamin D protects against the flu. One group of patients are given Vitamin D supplements, and one group is given a placebo. You plan on fitting a regression model to estimate the effect of the vitamin supplements, but note that some confounding factors that you have reason to believe are of importance, such as age and ethnicity, are missing from the data. You can therefore not include them as explanatory variables in the model. Should you still fit the model?
  
```

<br>

```{exercise, label="ch8ethics2"}
_Discuss the following._ You are fitting a linear regression model to a dataset from a medical study on a new drug which potentially can have serious side effects. The test subjects take a risk by participating in the study. Each observation in the dataset corresponds to a test subject. Like all ordinary linear regression models, your model gives more weight to observations that deviate from the average (and have a high leverage or Cook's distance). Given the risks involved for the test subjects, is it fair to give different weight to data from different individuals? Is it OK to remove outliers because they influence the results too much, meaning that the risk that the subject took was for nought?
  
```


## Generalised linear models
Generalised linear models, abbreviated GLM, are (yes) a generalisation of the linear model, that can be used when your response variable has a non-normal error distribution. Typical examples are when your response variable is binary (only takes two values, e.g. 0 or 1), or a count of something. Fitting GLM's is more or less entirely analogous to fitting linear models in R, but model diagnostics are very different. In this section we will look at some examples of how it can be done.

### Modelling proportions: Logistic regression {#logreg}
As the first example of binary data, we will consider the wine quality dataset `wine`\index{data!\texttt{wine}} from Cortez et al. (2009), which is available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Wine+Quality. It contains measurements on white and red vinho verde wine samples from northern Portugal.

<!-- 
white <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
                  sep = ";")
red <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
                  sep = ";")
-->
We start by loading the data. It is divided into two separate `.csv` files, one for white wines and one for red, which we have to merge:

```{r eval=FALSE}
# Import data about white and red wines:
white <- read.csv("https://tinyurl.com/winedata1",
                  sep = ";")
red <- read.csv("https://tinyurl.com/winedata2",
                  sep = ";")

# Add a type variable:
white$type <- "white"
red$type <- "red"

# Merge the datasets:
wine <- rbind(white, red)
wine$type <- factor(wine$type)

# Check the result:
summary(wine)
```

We are interested in seeing if measurements like pH (`pH`) and alcohol content (`alcohol`) can be used to determine the colours of the wine. The colour is represented by the `type` variable, which is binary.

Our model is that the `type` of a randomly selected wine is binomial $Bin(1, \pi_i)$-distributed (Bernoulli distributed), where $\pi_i$ depends on explanatory variables like pH and alcohol content. A common model for this situation is a _logistic regression model_. Given $n$ observations of $p$ explanatory variables, the model is:

$$\log\Big(\frac{\pi_i}{1-\pi_i}\Big)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n$$
Where we in linear regression models model the expected value of the response variable as a linear function of the explanatory variables, we now model the expected value of a _function of the expected value of the response variable_ (that is, a function of $\pi_i$). In GLM terminology, this function is known as a _link function_.

Logistic regression models can be fitted using the `glm` function\index{\texttt{glm}}. To specify what our model is, we use the argument `family = binomial`:

```{r eval=FALSE}
m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
summary(m)
```

The p-values presented in the summary table are based on a Wald test known to have poor performance unless the sample size is very large (Agresti, 2013). In this case, with a sample size of 6,497, it is probably safe to use, but for smaller sample sizes, it is preferable to use a bootstrap test instead, which you will do in Exercise \@ref(exr:ch7excGLM2).

The coefficients of a logistic regression model aren't as straightforward to interpret as those in a linear model. If we let $\beta$ denote a coefficient corresponding to an explanatory variable $x$, then:

* If $\beta$ is positive, then $\pi_i$ increases when $x_i$ increases.
* If $\beta$ is negative, then $\pi_i$ decreases when $x_i$ increases.
* $e^\beta$ is the _odds ratio_, which shows how much the odds $\frac{\pi_i}{1-\pi_1}$ change when $x_i$ is increased 1 step.

We can extract the coefficients and odds ratios using `coef`:

```{r eval=FALSE}
coef(m)      # Coefficients, beta
exp(coef(m)) # Odds ratios
```

To find the fitted probability that an observation belongs to the second class we can use `predict(m, type = "response")`:

```{r eval=FALSE}
# Check which class is the second one:
levels(wine$type) 
# "white" is the second class!

# Get fitted probabilities:
probs <- predict(m, type = "response")

# Check what the average prediction is for
# the two groups:
mean(probs[wine$type == "red"])
mean(probs[wine$type == "white"])
```

It turns out that the model predicts that most wines are white - even the red ones! The reason may be that we have more white wines (4,898) than red wines (1,599) in the dataset. Adding more explanatory variables could perhaps solve this problem. We'll give that a try in the next section.

$$\sim$$ 

```{exercise, label="ch7excGLM1"}
[Download `sharks.csv`\index{data!\texttt{sharks.csv}} file from the book's web page](http://www.modernstatisticswithr.com/data.zip). It contains information about shark attacks in South Africa. Using data on attacks that occurred in 2000 or later, fit a logistic regression model to investigate whether the age and sex of the individual that was attacked affect the probability of the attack being fatal.

Note: save the code for your model, as you will return to it in the subsequent exercises.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsGLM1)

<br>

```{exercise, label="ch7excGLM1b"}
In Section \@ref(broom) we saw how some functions from the `broom` package could be used to get summaries of linear models. Try using them with the `wine` data model that we created above. Do the `broom` functions work for generalised linear models as well?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsGLM1b)


### Bootstrap confidence intervals
In a logistic regression, the response variable $y_i$ is a binomial (or Bernoulli) random variable with success probability $\pi_i$. In this case, we don't want to resample residuals to create confidence intervals, as it turns out that this can lead to predicted probabilities outside the range $(0,1)$. Instead, we can either use the case resampling strategy described in Exercise \@ref(exr:ch7exc20b) or use a parametric bootstrap approach where we generate new binomial variables (Section \@ref(distfunctions)) to construct bootstrap confidence intervals.

To use case resampling, we can use `boot_summary` from `boot.pval`:

```{r eval=FALSE}
library(boot.pval)

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)

boot_summary(m, type = "perc", method = "case")
```

In the parametric approach, for each observation, the fitted success probability from the logistic model will be used to sample new observations of the response variable. This method can work well if the model is well-specified but tends to perform poorly for misspecified models, so make sure to carefully perform model diagnostics (as described in the next section) before applying it. To use the parametric approach, we can do as follows:

```{r eval=FALSE}
library(boot)

coefficients <- function(formula, data, predictions, ...) {
  # Check whether the response variable is a factor or
  # numeric, and then resample:
  if(is.factor(data[,all.vars(formula)[1]])) {
      # If the response variable is a factor:
      data[,all.vars(formula)[1]] <-
         factor(levels(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
                                              1, predictions)]) } else {
      # If the response variable is numeric:
      data[,all.vars(formula)[1]] <-
         unique(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
                                              1, predictions)] }
  
      m <- glm(formula, data = data, family = binomial)
      return(coef(m))
}

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)

boot_res <- boot(data = wine, statistic = coefficients,
                R = 999, 
                formula = formula(m),
                predictions = predict(m, type = "response"))

# Compute confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # pH
boot.ci(boot_res, type = "perc", index = 3) # Alcohol
```

$$\sim$$

```{exercise, label="ch7excGLM2"}
Use the model that you fitted to the `sharks.csv` data in Exercise \@ref(exr:ch7excGLM1) for the following:

1. When the `MASS` package is loaded, you can use `confint` to obtain (asymptotic) confidence intervals for the parameters of a GLM. Use it to compute confidence intervals for the parameters of your model for the `sharks.csv` data. 

2. Compute parametric bootstrap confidence intervals and p-values for the parameters of your logistic regression model for the `sharks.csv` data. Do they differ from the intervals obtained using `confint`? Note that there are a lot of missing values for the response variable. Think about how that will affect your bootstrap intervals and adjust your code accordingly.

3. Use the confidence interval inversion method of Section \@ref(intervalinversion) to compute bootstrap p-values for the effect of age.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsGLM2)


### Model diagnostics
It is notoriously difficult to assess model fit for GLM's, because the behaviour of the residuals is very different from residuals in ordinary linear models. In the case of logistic regression, the response variable is always 0 or 1, meaning that there will be two bands of residuals:

```{r eval=FALSE}
# Store deviance residuals:
m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
res <- data.frame(Predicted <- predict(m),
                  Residuals <- residuals(m, type ="deviance"),
                  Index = 1:nrow(m$data),
                  CooksDistance = cooks.distance(m))

# Plot fitted values against the deviance residuals:
library(ggplot2)
ggplot(res, aes(Predicted, Residuals)) +
      geom_point()

# Plot index against the deviance residuals:
ggplot(res, aes(Index, Residuals)) +
      geom_point()
```

Plots of raw residuals are of little use in logistic regression models. A better option is to use a binned residual plot, in which the observations are grouped into bins based on their fitted value. The average residual in each bin can then be computed, which will tell us if which parts of the model have a poor fit. A function for this is available in the `arm` package:

```{r eval=FALSE}
install.packages("arm")

library(arm)
binnedplot(predict(m, type = "response"),
           residuals(m, type = "response"))
```

The grey lines show confidence bounds which are supposed to contain about 95 % of the bins. If too many points fall outside these bounds, it's a sign that we have a poor model fit. In this case, there are a few points outside the bounds. Most notably, the average residuals are fairly large for the observations with the lowest fitted values, i.e. among the observations with the lowest predicted probability of being white wines.

Let's compare the above plot to that for a model with more explanatory variables:

```{r eval=FALSE}
m2 <- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar,
          data = wine, family = binomial)

binnedplot(predict(m2, type = "response"),
           residuals(m2, type = "response"))
```

This looks much better - adding more explanatory variable appears to have improved the model fit.

It's worth repeating that if your main interest is hypothesis testing, you shouldn't fit multiple models and then pick the one that gives the best results. However, if you're doing an exploratory analysis or are interested in predictive modelling, you can and should try different models. It can then be useful to do a formal hypothesis test of the null hypothesis that `m` and `m2` fit the data equally well, against the alternative that `m2` has a better fit. If both fit the data equally well, we'd prefer `m`, since it is a simpler model. We can use `anova`\index{\texttt{anova}} to perform a likelihood ratio _deviance test_ (see Section \@ref(deviance) for details), which tests this:

```{r eval=FALSE}
anova(m, m2, test = "LRT")
```

The p-value is very low, and we conclude that `m2` has a better model fit.

Another useful function is `cooks.distance`\index{\texttt{cooks.distance}}, which can be used to compute the Cook's distance for each observation, which is useful for finding influential observations. In this case, I've chosen to print the row numbers for the observations with a Cook's distance greater than 0.004 - this number has been arbitrarily chosen in order only to highlight the observations with the highest Cook's distance.

```{r eval=FALSE}
res <- data.frame(Index = 1:length(cooks.distance(m)),
                  CooksDistance = cooks.distance(m))

# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > 0.004,
                                   rownames(res), "")),
                hjust = 1.1)
```

$$\sim$$ 

```{exercise, label="ch7excGLM3"}
Investigate the residuals for your `sharks.csv` model. Are there any problems with the model fit? Any influential points?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsGLM3)


### Prediction
Just as for linear models, we can use `predict` to make predictions for new observations using a GLM. To begin with, let's randomly sample 10 rows from the `wine` data and fit a model using all data except those ten observations:

```{r eval=FALSE}
# Randomly select 10 rows from the wine data:
rows <- sample(1:nrow(wine), 10)

m <- glm(type ~ pH + alcohol, data = wine[-rows,], family = binomial)
```

We can now use `predict` to make predictions for the ten observations:

```{r eval=FALSE}
preds <- predict(m, wine[rows,])
preds
```

Those predictions look a bit strange though - what are they? By default, `predict` returns predictions on the scale of the link function. That's not really what we want in most cases - instead, we are interested in the predicted probabilities. To get those, we have to add the argument `type = "response"` to the call:

```{r eval=FALSE}
preds <- predict(m, wine[rows,], type = "response")
preds
```

Logistic regression models are often used for prediction, in what is known as classification. Section \@ref(classifieraccuracy) is concerned with how to evaluate the predictive performance of logistic regression and other classification models.


### Modelling count data
Logistic regression is but one of many types of GLM's used in practice. One important example is Cox regression, which is used for survival data. We'll return to that model in Section \@ref(survival). For now, we'll consider count data instead. Let's have a look at the shark attack data in `sharks.csv`, available on the book's website. It contains data about shark attacks in South Africa, downloaded from The Global Shark Attack File (http://www.sharkattackfile.net/incidentlog.htm). To load it, we download the file and set `file_path` to the path of `sharks.csv`:

```{r eval=FALSE}
sharks <- read.csv(file_path, sep =";")

# Compute number of attacks per year:
attacks <- aggregate(Type ~ Year, data = sharks, FUN = length)

# Keep data for 1960-2019:
attacks <- subset(attacks, Year >= 1960)
```

The number of attacks in a year is not binary but a count that, in principle, can take any non-negative integer as its value. Are there any trends over time for the number of reported attacks?

```{r eval=FALSE}
# Plot data from 1960-2019:
library(ggplot2)
ggplot(attacks, aes(Year, Type)) +
      geom_point() +
      ylab("Number of attacks")
```

No trend is evident. To confirm this, let's fit a regression model with `Type` (the number of attacks) as the response variable and `Year` as an explanatory variable. For count data like this, a good first model to use is _Poisson regression_. Let $\mu_i$  denote the expected value of the response variable given the explanatory variables. Given $n$ observations of $p$ explanatory variables, the Poisson regression model is:

$$\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n$$
To fit it, we use `glm` as before, but this time with `family = poisson`:

```{r eval=FALSE}
m <- glm(Type ~ Year, data = attacks, family = poisson)
summary(m)
```

We can add the curve corresponding to the fitted model to our scatterplot as follows:

```{r eval=FALSE}
attacks_pred <- data.frame(Year = attacks$Year, at_pred =
                             predict(m, type = "response"))

ggplot(attacks, aes(Year, Type)) +
      geom_point() +
      ylab("Number of attacks") +
      geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
                colour = "red")
```

The fitted model seems to confirm our view that there is no trend over time in the number of attacks.

For model diagnostics, we can use a binned residual plot and a plot of Cook's distance to find influential points:

```{r eval=FALSE}
# Binned residual plot:
library(arm)
binnedplot(predict(m, type = "response"),
           residuals(m, type = "response"))


# Plot index against the Cook's distance to find
# influential points:
res <- data.frame(Index = 1:nrow(m$data),
                  CooksDistance = cooks.distance(m))
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > 0.1,
                                   rownames(res), "")),
                hjust = 1.1)
```

A common problem in Poisson regression models is excess zeros, i.e. more observations with value 0 than what is predicted by the model. To check the distribution of counts in the data, we can draw a histogram:

```{r eval=FALSE}
ggplot(attacks, aes(Type)) +
      geom_histogram(binwidth = 1, colour = "black")
```

If there are a lot of zeroes in the data, we should consider using another model, such as a hurdle model or a zero-inflated Poisson regression. Both of these are available in the `pscl` package\index{\texttt{pscl}}.

Another common problem is _overdispersion_, which occurs when there is more variability in the data than what is predicted by the GLM. A formal test of overdispersion (Cameron & Trivedi, 1990) is provided by `dispersiontest` in the `AER` package. The null hypothesis is that there is no overdispersion, and the alternative that there is overdispersion:

```{r eval=FALSE}
install.packages("AER")

library(AER)
dispersiontest(m, trafo = 1)
```

There are several alternative models that can be considered in the case of overdispersion. One of them is _negative binomial regression_, which uses the same link function as Poisson regression. We can fit it using the `glm.nb`\index{\texttt{glm.nb}} function from `MASS`:

```{r eval=FALSE}
library(MASS)
m_nb <- glm.nb(Type ~ Year, data = attacks)

summary(m_nb)
```

For the shark attack data, the predictions from the two models are virtually identical, meaning that both are equally applicable in this case:

```{r eval=FALSE}
attacks_pred <- data.frame(Year = attacks$Year, at_pred =
                             predict(m, type = "response"))
attacks_pred_nb <- data.frame(Year = attacks$Year, at_pred =
                             predict(m_nb, type = "response"))

ggplot(attacks, aes(Year, Type)) +
      geom_point() +
      ylab("Number of attacks") +
      geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
                colour = "red") +
      geom_line(data = attacks_pred_nb, aes(x = Year, y = at_pred),
                colour = "blue", linetype = "dashed")
```

Finally, we can obtain bootstrap confidence intervals e.g. using case resampling, using `boot_summary`:

```{r eval=FALSE}
library(boot.pval)
boot_summary(m_nb, type = "perc", method = "case")
```

$$\sim$$ 

```{exercise, label="ch7excGLM4"}
The `quakes` dataset\index{data!\texttt{quakes}}, available in base R, contains information about seismic events off Fiji. Fit a Poisson regression model with `stations` as the response variable and `mag` as an explanatory variable. Are there signs of overdispersion? Does using a negative binomial model improve the model fit?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsGLM4)

### Modelling rates
Poisson regression models, and related models like negative binomial regression, can not only be used to model count data. They can also be used to model _rate data_, such as the number of cases per capita or the number of cases per unit area. In that case, we need to include an _exposure_ variable $N$ that describes e.g. the population size or area corresponding to each observation. The model will be that:

$$\log(\mu_i/N_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n.$$
Because $\log(\mu_i/N_i)=\log(\mu_i)-\log(N_i)$, this can be rewritten as:

$$\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip}+\log(N_i),\qquad i=1,\ldots,n.$$
In other words, we should include $\log(N_i)$ on the right-hand side of our model, _with a known coefficient_ equal to 1. In regression, such a term is known as an _offset_\index{offset}. We can add it to our model using the `offset`\index{\texttt{offset}} function.

As an example, we'll consider the `ships`\index{data!\texttt{ships}} data from the `MASS` package. It describes the number of damage incidents for different ship types operating in the 1960's and 1970's, and includes information about how many months each ship type was in service (i.e. each ship type's _exposure_):

```{r eval=FALSE}
library(MASS)
?ships
View(ships)
```

For our example, we'll use ship `type` as the explanatory variable, `incidents` as the response variable and `service` as the exposure variable. First, we remove observations with 0 exposure (by definition, these can't be involved in incidents, and so there is no point in including them in the analysis). Then, we fit the model using `glm` and `offset`:

```{r eval=FALSE}
ships <- ships[ships$service != 0,]

m <- glm(incidents ~ type + offset(log(service)),
         data = ships,
         family = poisson)

summary(m)
```

Model diagnostics can be performed as in the previous sections.

Rate models are usually interpreted in terms of the rate ratios $e^{\beta_j}$, which describe the multiplicative increases of the intensity of rates when $x_j$ is increased by one unit. To compute the rate ratios for our model, we use `exp`:

```{r eval=FALSE}
exp(coef(m))
```

$$\sim$$ 

```{exercise, label="ch7excGLM5"}
Compute bootstrap confidence intervals for the rate ratios in the model for the `ships` data.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsGLM5)


### Bayesian estimation of generalised linear models
We can fit a Bayesian GLM\index{generalised linear model!Bayesian estimation} with the `rstanarm` package, using `stan_glm` in the same way we did for linear models. Let's look at an example with the `wine` data. First, we load and prepare the data:

```{r eval=FALSE}
# Import data about white and red wines:
white <- read.csv("https://tinyurl.com/winedata1",
                  sep = ";")
red <- read.csv("https://tinyurl.com/winedata2",
                  sep = ";")
white$type <- "white"
red$type <- "red"
wine <- rbind(white, red)
wine$type <- factor(wine$type)
```

Now, we fit a Bayesian logistic regression model:

```{r eval=FALSE}
library(rstanarm)
m <- stan_glm(type ~ pH + alcohol, data = wine, family = binomial)

# Print the estimates:
coef(m)
```

Next, we can plot the posterior distributions of the effects:

```{r eval=FALSE}
plot(m, "dens", pars = names(coef(m)))
```

To get 95 % credible intervals for the effects, we can use `posterior_interval`. We can also use `plot` to visualise them:

```{r eval=FALSE}
posterior_interval(m, 
        pars = names(coef(m)),
        prob = 0.95)

plot(m, "intervals",
        pars = names(coef(m)),
        prob = 0.95)
```

Finally, we can use $\hat{R}$ to check model convergence. It should be less than 1.1 if the fitting has converged:

```{r eval=FALSE}
plot(m, "rhat")
```


## Mixed models {#mixedmodels}
Mixed models are used in regression problems where measurements have been made on clusters of related units. As the first example of this, we'll use a dataset from the `lme4`\index{\texttt{lme4}}\index{mixed model}\index{linear model!mixed} package, which also happens to contain useful methods for mixed models. Let's install it:

```{r eval=FALSE}
install.packages("lme4")
```

The `sleepstudy`\index{data!\texttt{sleepstudy}} dataset from `lme4` contains data from a study on reaction times in a sleep deprivation study. The participants were restricted to 3 hours of sleep per night, and their average reaction time on a series of tests were measured each day during the 9 days that the study lasted:

```{r eval=FALSE}
library(lme4)
?sleepstudy
str(sleepstudy)
```

Let's start our analysis by making boxplots showing reaction times for each subject. We'll also superimpose the observations for each participant on top of their boxplots:

```{r eval=FALSE}
library(ggplot2)
ggplot(sleepstudy, aes(Subject, Reaction)) +
      geom_boxplot() + 
      geom_jitter(aes(colour = Subject), 
                  position = position_jitter(0.1))
```

We are interested in finding out if the reaction times increase when the participants have been starved for sleep for a longer period. Let's try plotting reaction times against days, adding a regression line:

```{r eval=FALSE}
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
      geom_point() +
      geom_smooth(method = "lm", colour = "black", se = FALSE)
```

As we saw in the boxplots, and can see in this plot too, some participants always have comparatively high reaction times, whereas others always have low values. There are clear differences between individuals, and the measurements for each individual will be correlated. This violates a fundamental assumption of the traditional linear model, namely that all observations are independent.

In addition to this, it also seems that the reaction times change in different ways for different participants, as can be seen if we facet the plot by test subject:

```{r eval=FALSE}
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
      geom_point() +
      theme(legend.position = "none") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_smooth(method = "lm", colour = "black", se = FALSE)
```

Both the intercept and the slope of the average reaction time differs between individuals. Because of this, the fit given by the single model can be misleading. Moreover, the fact that the observations are correlated will cause problems for the traditional intervals and tests. We need to take this into account when we estimate the overall intercept and slope.

One approach could be to fit a single model for each subject. That doesn't seem very useful though. We're not really interested in these particular test subjects, but in how sleep deprivation affects reaction times in an average person. It would be much better to have a single model that somehow incorporates the correlation between measurements made on the same individual. That is precisely what a linear mixed regression model does.


### Fitting a linear mixed model
A linear mixed model (LMM) has two types of effects (explanatory variables):

* _Fixed effects_, which are non-random. These are usually the variables of primary interest in the data. In the `sleepstudy` example, `Days` is a fixed effect.
* _Random effects_, which represent nuisance variables that cause measurements to be correlated. These are usually not of interest in and of themselves, but are something that we need to include in the model to account for correlations between measurements. In the `sleepstudy` example, `Subject` is a random effect.

Linear mixed models can be fitted using `lmer`\index{\texttt{lmer}} from the `lme4` package. The syntax is the same as for `lm`, with the addition of random effects. These can be included in different ways. Let's have a look at them.

First, we can include a _random intercept_, which gives us a model where the intercept (but not the slope) varies between test subjects. In our example, the formula for this is:

```{r eval=FALSE}
library(lme4)
m1 <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
```

Alternatively, we could include a _random slope_ in the model, in which case the slope (but not the intercept) varies between test subjects. The formula would be:

```{r eval=FALSE}
m2 <- lmer(Reaction ~ Days + (0 + Days|Subject), data = sleepstudy)
```

Finally, we can include both a random intercept and random slope in the model. This can be done in two different ways, as we can model the intercept and slope as being correlated or uncorrelated:

```{r eval=FALSE}
# Correlated random intercept and slope:
m3 <- lmer(Reaction ~ Days + (1 + Days|Subject), data = sleepstudy)

# Uncorrelated random intercept and slope:
m4 <- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
           data = sleepstudy)
```

Which model should we choose? Are the intercepts and slopes correlated? It could of course be the case that individuals with a high intercept have a smaller slope - or a greater slope! To find out, we can fit different linear models to each subject, and then make a scatterplot of their intercepts and slopes. To fit a model to each subject, we use `split` and `map` as in Section \@ref(multipredict):

```{r eval=FALSE}
# Collect the coefficients from each linear model:
library(purrr)
sleepstudy %>% split(.$Subject) %>% 
              map(~ lm(Reaction ~ Days, data = .)) %>% 
              map(coef) -> coefficients

# Convert to a data frame:
coefficients <- data.frame(matrix(unlist(coefficients),
                  nrow = length(coefficients),
                  byrow = TRUE),
                  row.names = names(coefficients))
names(coefficients) <- c("Intercept", "Days")

# Plot the coefficients:
ggplot(coefficients, aes(Intercept, Days,
                         colour = row.names(coefficients))) +
      geom_point() +
      geom_smooth(method = "lm", colour = "black", se = FALSE) +
      labs(fill = "Subject")

# Test the correlation:
cor.test(coefficients$Intercept, coefficients$Days)
```

The correlation test is not significant, and judging from the plot, there is little indication that the intercept and slope are correlated. We saw earlier that both the intercept and the slope seem to differ between subjects, and so `m4` seems like the best choice here. Let's stick with that, and look at a summary table for the model.

```{r eval=FALSE}
summary(m4, correlation = FALSE)
```

I like to add `correlation = FALSE` here, which suppresses some superfluous output from `summary`.

You'll notice that unlike the `summary` table for linear models, there are no p-values! This is a deliberate design choice from the `lme4` developers, who argue that the approximate test available aren't good enough for small sample sizes (Bates et al., 2015).

Using the bootstrap, as we will do in Section \@ref(lmmbootstrap), is usually the best approach for mixed models. If you _really_ want some quick p-values\index{mixed model!p-value}, you can load the `lmerTest` package, which adds p-values computed using the Satterthwaite approximation (Kuznetsova et al., 2017). This is better than the usual approximate test, but still not perfect.

```{r eval=FALSE}
install.packages("lmerTest")

library(lmerTest)
m4 <- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
           data = sleepstudy)
summary(m4, correlation = FALSE)
```

If we need to extract the model coefficients, we can do so using `fixef` (for the fixed effects) and `ranef` (for the random effects)\index{\texttt{fixef}}\index{\texttt{ranef}}:

```{r eval=FALSE}
fixef(m4)
ranef(m4)
```

If we want to extract the variance components from the model, we can use `VarCorr`\index{\texttt{VarCorr}}:

```{r eval=FALSE}
VarCorr(m4)
```

Let's add the lines from the fitted model to our facetted plot, to compare the results of our mixed model to the lines that were fitted separately for each individual:

```{r eval=FALSE}
mixed_mod <- coef(m4)$Subject
mixed_mod$Subject <- row.names(mixed_mod)

ggplot(sleepstudy, aes(Days, Reaction)) +
      geom_point() +
      theme(legend.position = "none") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_smooth(method = "lm", colour = "cyan", se = FALSE,
                  size = 0.8) +
      geom_abline(aes(intercept = `(Intercept)`, slope = Days,
                      color = "magenta"),
                  data = mixed_mod, size = 0.8)
```

Notice that the lines differ. The intercept and slopes have been _shrunk_ toward the global effects, i.e. toward the average of all lines.

$$\sim$$ 

```{exercise, label="ch7excLMM1"}
Consider the `Oxboys` data from the `nlme` package. Does a mixed model seem appropriate here? If so, is the intercept and slope for different subjects correlated? Fit a suitable model, with `height` as the response variable.

Save the code for your model, as you will return to it in the next few exercises.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM1)

<br>

```{exercise, label="ch7excLMM1b"}
The `broom.mixed`\index{\texttt{broom.mixed}} package allows you to get summaries of mixed models as data frames, just as `broom` does for linear and generalised linear models. Install it and use it to get the summary table for the model for the `Oxboys` data that you created in the previous exercise. How are fixed and random effects included in the table?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM1b)

### Model diagnostics
As for any linear model, residual plots are useful for diagnostics for linear mixed models. Of particular interest are signs of heteroscedasticity, as homoscedasticity is assumed in the mixed model. We'll use `fortify.merMod`\index{\texttt{fortify.merMod}} to turn the model into an object that can be used with `ggplot2`, and then create some residual plots:

```{r eval=FALSE}
library(ggplot2)
fm4 <- fortify.merMod(m4)

# Plot residuals:
ggplot(fm4, aes(.fitted, .resid)) +
      geom_point() +
      geom_hline(yintercept = 0) +
      xlab("Fitted values") + ylab("Residuals")

# Compare the residuals of different subjects:
ggplot(fm4, aes(Subject, .resid)) +
      geom_boxplot() +
      coord_flip() +
      ylab("Residuals")

# Observed values versus fitted values:
ggplot(fm4, aes(.fitted, Reaction)) +
      geom_point(colour = "blue") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_abline(intercept = 0, slope = 1) +
      xlab("Fitted values") + ylab("Observed values")

## Q-Q plot of residuals:
ggplot(fm4, aes(sample = .resid)) +
        geom_qq() + geom_qq_line()

## Q-Q plot of random effects:
ggplot(ranef(m4)$Subject, aes(sample = `(Intercept)`)) +
        geom_qq() + geom_qq_line()
ggplot(ranef(m4)$Subject, aes(sample = `Days`)) +
        geom_qq() + geom_qq_line()
```

The normality assumption appears to be satisfied, but there are some signs of heteroscedasticity in the boxplots of the residuals for the different subjects.


$$\sim$$ 

```{exercise, label="ch7excLMM2"}
Return to your mixed model for the `Oxboys` data from Exercise \@ref(exr:ch7excLMM1). Make diagnostic plots for the model. Are there any signs of heteroscedasticity or non-normality?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM2)


### Bootstrapping {#lmmbootstrap}
Summary tables, including p-values, for the fixed effects are available through `boot_summary`:

```{r eval=FALSE}
library(boot.pval)
boot_summary(m4, type = "perc")
```

`boot_summary` calls a function called `bootMer`\index{\texttt{bootMer}}\index{bootstrap!mixed model}\index{mixed model!bootstrap}, which performs parametric resampling from the model. In case you want to call it directly, you can do as follows:

```{r eval=FALSE}
boot_res <- bootMer(m4, fixef, nsim = 999)

library(boot)
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Days
```


### Nested random effects and multilevel/hierarchical models
In many cases, a random factor is _nested_ within another\index{mixed model!nested}. To see an example of this, consider the `Pastes`\index{data!\texttt{Pastes}} data from `lme4`:

```{r eval=FALSE}
library(lme4)
?Pastes
str(Pastes)
```

We are interested in the `strength` of a chemical product. There are ten delivery batches (`batch`), and three casks within each delivery (`cask`). Because of variations in manufacturing, transportation, storage, and so on, it makes sense to include random effects for both `batch` and `cask` in a linear mixed model. However, each cask only appears within a single batch, which makes the `cask` effect _nested_ within `batch`.

Models that use nested random factors are commonly known as _multilevel models_ (the random factors exist at different "levels"), or _hierarchical models_ (there is a hierarchy between the random factors). These aren't really any different from other mixed models, but depending on how the data is structured, we may have to be a bit careful to get the nesting right when we fit the model with `lmer`.

If the two effects weren't nested, we could fit a model using:

```{r eval=FALSE}
# Incorrect model:
m1 <- lmer(strength ~ (1|batch) + (1|cask),
           data = Pastes)
summary(m1, correlation = FALSE)
```

However, because the casks are labelled `a`, `b`, and `c` within each batch, we've now fitted a model where casks from different batches are treated as being equal! To clarify that the labels `a`, `b`, and `c` belong to different casks in different batches, we need to include the nesting in our formula. This is done as follows:

```{r eval=FALSE}
# Cask in nested within batch:
m2 <- lmer(strength ~ (1|batch/cask),
           data = Pastes)
summary(m2, correlation = FALSE)
```

Equivalently, we can also use:

```{r eval=FALSE}
m3 <- lmer(strength ~ (1|batch) + (1|batch:cask),
           data = Pastes)
summary(m3, correlation = FALSE)
```

### ANOVA with random effects
The `lmerTest` package provides ANOVA tables that allow us to use random effects in ANOVA models. To use it, simply load `lmerTest` before fitting a model with `lmer`, and then run\index{ANOVA!mixed model}\index{mixed model!ANOVA} `anova(m, type = "III")` (or replace `III` with `II` or `I` if you want a type II or type I ANOVA table instead).

As an example, consider the `TVbo`\index{data!\texttt{TVbo}} data from `lmerTest`. 3 types of TV sets were compared by 8 assessors for 4 different pictures. To see if there is a difference in the mean score for the colour balance of the TV sets, we can fit a mixed model. We'll include a random intercept for the assessor. This is a balanced design (in which case the results from all three types of tables coincide):

```{r eval=FALSE}
library(lmerTest)

# TV data:
?TVbo

# Fit model with both fixed and random effects:
m <- lmer(Colourbalance ~ TVset*Picture + (1|Assessor),
         data = TVbo)

# View fitted model:
m

# All three types of ANOVA table give the same results here:
anova(m, type = "III")
anova(m, type = "II")
anova(m, type = "I")
```

The interaction effect is significant at the 5 % level. As for other ANOVA models, we can visualise this with an interaction plot:

```{r eval=FALSE}
interaction.plot(TVbo$TVset, TVbo$Picture,
                 response = TVbo$Colourbalance)
```

$$\sim$$ 

```{exercise, label="ch7excLMM2b"}
Fit a mixed effects ANOVA to the `TVbo` data, using `Coloursaturation` as the response variable, `TVset` and `Picture` as fixed effects, and `Assessor` as a random effect. Does there appear to be a need to include the interaction between `Assessor` and `TVset` as a random effect? If so, do it.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM2b)


### Generalised linear mixed models
Everything that we have just done for the linear mixed models carries over to _generalised linear mixed models_ (GLMM), which are GLM's with both fixed and random effects\index{mixed model!generalised}.

A common example is the _item response model_, which plays an important role in psychometrics. This model is frequently used in psychological tests containing multiple questions or sets of questions ("items"), where both the subject the item are considered random effects. As an example, consider the `VerbAgg`\index{data!\texttt{VerbAgg}} data from `lme4`:

```{r eval=FALSE}
library(lme4)
?VerbAgg
View(VerbAgg)
```

We'll use the binary version of the response, `r2`, and fit a logistic mixed regression model to the data, to see if it can be used to explain the subjects' responses. The formula syntax is the same as for linear mixed models, but now we'll use `glmer`\index{\texttt{glmer}} to fit a GLMM. We'll include `Anger` and `Gender` as fixed effects (we are interested in seeing how these affect the response) and `item` and `id` as random effects with random slopes (we believe that answers to the same item and answers from the same individual may be correlated):

```{r eval=FALSE}
m <- glmer(r2 ~ Anger + Gender + (1|item) + (1|id),
           data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
```

We can plot the fitted random effects for `item` to verify that there appear to be differences between the different items:

```{r eval=FALSE}
mixed_mod <- coef(m)$item
mixed_mod$item <- row.names(mixed_mod)

ggplot(mixed_mod, aes(`(Intercept)`, item)) +
      geom_point() +
      xlab("Random intercept")
```

The `situ` variable, describing situation type, also appears interesting. Let's include it as a fixed effect. Let's also allow different situational (random) effects for different respondents. It seems reasonable that such responses are random rather than fixed (as in the [solution](#ch7solutionsLMM2b) to Exercise \@ref(exr:ch7excLMM2b)), and we do have repeated measurements of these responses. We'll therefore also include `situ` as a random effect nested within `id`:

```{r eval=FALSE}
m <- glmer(r2 ~ Anger + Gender + situ + (1|item) + (1|id/situ),
           data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
```

Finally, we'd like to obtain bootstrap confidence intervals for fixed effects. Because this is a fairly large dataset ($n=7,584$) this can take a looong time to run, so stretch your legs and grab a cup of coffee or two while you wait:

```{r eval=FALSE}
library(boot.pval)
boot_summary(m, type = "perc", R = 100)
# Ideally, R should be greater, but for the sake of
# this example, we'll use a low number.
```

$$\sim$$ 

```{exercise, label="ch7excLMM3"}
Consider the `grouseticks` data from the `lme4` package (Elston et al., 2001). Fit a mixed Poisson regression model to the data, with `TICKS` as the response variable and `YEAR` and `HEIGHT` as fixed effects. What variables are suitable to use for random effects? Compute a bootstrap confidence interval for the effect of `HEIGHT`.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM3)


### Bayesian estimation of mixed models
From a numerical point of view\index{mixed model!Bayesian estimation}, using Bayesian modelling with `rstanarm` is preferable to frequentist modelling with `lme4` if you have complex models with many random effects. Indeed, for some models, `lme4` will return a warning message about a singular fit, basically meaning that the model is too complex, whereas `rstanarm`, powered by the use of a prior distribution, always will return a fitted model regardless of complexity.

After loading `rstanarm`, fitting a Bayesian linear mixed model with a weakly informative prior is as simple as substituting `lmer` with `stan_lmer`\index{\texttt{stan\_lmer}}:

```{r eval=FALSE}
library(lme4)
library(rstanarm)
m4 <- stan_lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
           data = sleepstudy)

# Print the results:
m4
```

To plot the posterior distributions for the coefficients of the fixed effects, we can use `plot`, specifying which effects we are interested in using `pars`:

```{r eval=FALSE}
plot(m4, "dens", pars = c("(Intercept)", "Days"))
```

To get 95 % credible intervals for the fixed effects, we can use `posterior_interval` as follows:

```{r eval=FALSE}
posterior_interval(m4, 
        pars = c("(Intercept)", "Days"),
        prob = 0.95)
```

We can also plot them using `plot`:

```{r eval=FALSE}
plot(m4, "intervals",
        pars = c("(Intercept)", "Days"),
        prob = 0.95)
```

Finally, we'll check that the model fitting has converged:

```{r eval=FALSE}
plot(m4, "rhat")
```


## Survival analysis {#survival}
Many studies are concerned with the duration of time until an event happens: time until a machine fails, time until a patient diagnosed with a disease dies, and so on. In this section we will consider some methods for _survival analysis_\index{survival analysis} (also known as reliability analysis in engineering and duration analysis in economics), which is used for analysing such data. The main difficulty here is that studies often end before all participants have had events, meaning that some observations are _right-censored_ - for these observations, we don't know when the event happened, but only that it happened after the end of the study.

The `survival`\index{\texttt{survival}} package contains a number of useful methods for survival analysis. Let's install it:

```{r eval=FALSE}
install.packages("survival")
```

We will study the lung cancer data in `lung`\index{data!\texttt{lung}}:

```{r eval=FALSE}
library(survival)
?lung
View(lung)
```

The survival times of the patients consist of two parts: `time` (the time from diagnosis until either death or the end of the study) and `status` (1 if the observations is censored, 2 if the patient died before the end of the study). To combine these so that they can be used in a survival analysis, we must create a `Surv` object\index{\texttt{Surv}}:

```{r eval=FALSE}
Surv(lung$time, lung$status)
```

Here, a `+` sign after a value indicates right-censoring.

### Comparing groups
Survival times are best visualised using Kaplan-Meier curves that show the proportion of surviving patients. Let's compare the survival times of women and men. We first fit a survival model using `survfit`\index{\texttt{survfit}}, and then draw the Kaplan-Meier curve (with parametric confidence intervals) using `autoplot` from `ggfortify`: 

```{r eval=FALSE}
library(ggfortify)
library(survival)
m <- survfit(Surv(time, status) ~ sex, data = lung)
autoplot(m)
```

To print the values for the survival curves at different time points, we can use `summary`:

```{r eval=FALSE}
summary(m)
```

To test for differences between two groups, we can use the logrank test\index{survival analysis!logrank test} (also known as the Mantel-Cox test), given by `survfit`:

```{r eval=FALSE}
survdiff(Surv(time, status) ~ sex, data = lung)
```

Another option is the Peto-Peto test\index{survival analysis!Peto-Peto test}, which puts more weight on early events (deaths, in the case of the `lung` data), and therefore is suitable when such events are of greater interest. In contrast, the logrank test puts equal weights on all events regardless of when they occur. The Peto-Peto test is obtained by adding the argument `rho = 1`:

```{r eval=FALSE}
survdiff(Surv(time, status) ~ sex, rho = 1, data = lung)
```

The `Hmisc`\index{\texttt{Hmisc}} package contains a function for obtaining confidence intervals based on the Kaplan-Meier estimator, called `bootkm`\index{\texttt{bootkm}}. This allows us to get confidence intervals for the quantiles (including the median) of the survival distribution for different groups, as well as for differences between the quantiles of different groups. First, let's install it:

```{r eval=FALSE}
install.packages("Hmisc")
```

We can now use `bootkm` to compute bootstrap confidence intervals for survival times based on the `lung` data. We'll compute an interval for the median survival time for females, and one for the difference in median survival time between females and males:

```{r eval=FALSE}
library(Hmisc)

# Create a survival object:
survobj <- Surv(lung$time, lung$status)

# Get bootstrap replicates of the median survival time for
# the two groups:
median_surv_time_female <- bootkm(survobj[lung$sex == 2],
                                  q = 0.5, B = 999)
median_surv_time_male <- bootkm(survobj[lung$sex == 1],
                                q = 0.5, B = 999)

# 95 % bootstrap confidence interval for the median survival time
# for females:
quantile(median_surv_time_female,
         c(.025,.975), na.rm=TRUE)

# 95 % bootstrap confidence interval for the difference in median
# survival time:
quantile(median_surv_time_female - median_surv_time_male,
         c(.025,.975), na.rm=TRUE)
```

To obtain confidence intervals for other quantiles, we simply change the argument `q` in `bootkm`.

$$\sim$$
```{exercise, label="ch7excLMM3b"}
Consider the `ovarian` data\index{data!\texttt{ovarian}} from the `survival` package. Plot Kaplan-Meier curves comparing the two treatment groups. Compute a bootstrap confidence interval for the difference in the 75 % quantile for the survival time for the two groups.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM3b)


### The Cox proportional hazards model
The hazard function, or hazard rate, is the rate of events at time $t$ if a subject has survived until time $t$. The higher the hazard, the greater the probability of an event. Hazard rates play an integral part in survival analysis, particularly in regression models. To model how the survival times are affected by different explanatory variables, we can use a _Cox proportional hazards model_ (Cox, 1972)\index{survival analysis!Cox PH regression}, fitted using `coxph`\index{\texttt{coxph}}:

```{r eval=FALSE}
m <- coxph(Surv(time, status) ~ age + sex, data = lung)
summary(m)
```

The exponentiated coefficients show the hazard ratios, i.e. the relative increases (values greater than 1) or decreases (values below 1) of the hazard rate when a covariate is increased one step while all others are kept fixed:

```{r eval=FALSE}
exp(coef(m))
```

In this case, the hazard increases with age (multiply the hazard by 1.017 for each additional year that the person has lived), and is lower for women (`sex=2`) than for men (`sex=1`).

The `censboot_summary`\index{\texttt{censboot\_summary}} function from `boot.pval` provides a table of estimates, bootstrap confidence intervals, and bootstrap p-values for the model coefficients. The `coef` argument can be used to specify whether to print confidence intervals for the coefficients or for the exponentiated coeffientes (i.e. the hazard ratios):

```{r eval=FALSE}
# censboot_summary requires us to use model = TRUE
# when fitting our regression model:
m <- coxph(Surv(time, status) ~ age + sex,
           data = lung, model = TRUE)

library(boot.pval)
# Original coefficients:
censboot_summary(m, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m, type = "perc", coef = "exp")
```

To manually obtain bootstrap confidence intervals for the exponentiated coefficients, we can use the `censboot`\index{\texttt{censboot}} function from `boot` as follows:

```{r eval=FALSE}
# Function to get the bootstrap replicates of the exponentiated
# coefficients:
boot_fun <- function(data, formula) {
     m_boot <- coxph(formula, data = data)
     return(exp(coef(m_boot)))
}

# Run the resampling:
library(boot)
boot_res <- censboot(lung[,c("time", "status", "age", "sex")],
                     boot_fun, R = 999, 
                     formula = 
                       formula(Surv(time, status) ~ age + sex))

# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Age
boot.ci(boot_res, type = "perc", index = 2) # Sex
```

As the name implies, the Cox proportional hazards model relies on the assumption of _proportional hazards_, which essentially means that the effect of the explanatory variables is constant over time. This can be assessed visually by plotting the model residuals, using `cox.zph`\index{\texttt{cox.zph}} and the `ggcoxzph` function from the `survminer` package\index{\texttt{survminer}}\index{\texttt{ggcoxzph}}. Specifically, we will plot the scaled Schoenfeld (1982) residuals, which measure the difference between the observed covariates and the expected covariates given the risk at the time of an event. If the proportional hazards assumption holds, then there should be no trend over time for these residuals. Use the trend line to aid the eye:

```{r eval=FALSE}
install.packages("survminer")
library(survminer)

ggcoxzph(cox.zph(m), var = 1) # age
ggcoxzph(cox.zph(m), var = 2) # sex

# Formal p-values for a test of proportional
# hazards, for each variable:
cox.zph(m)
```

In this case, there are no apparent trends over time (which is in line with the corresponding formal hypothesis tests), indicating that the proportional hazards model could be applicable here.

$$\sim$$ 

```{exercise, label="ch7excLMM3c"}
Consider the `ovarian` data\index{data!\texttt{ovarian}} from the `survival` package.

1. Use a Cox proportional hazards regression to test whether there is a difference between the two treatment groups, adjusted for age.
2. Compute bootstrap confidence interval for the hazard ratio of age.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM3c)


<br>

```{exercise, label="ch7excLMM4"}
Consider the `retinopathy` data\index{data!\texttt{retinopathy}} from the `survival` package. We are interested in a mixed survival model, where `id` is used to identify patients and `type`, `trt`and `age` are fixed effects. Fit a mixed Cox proportional hazards regression (add `cluster = id` to the call to `coxph` to include this as a random effect). Is the assumption of proportional hazards fulfilled?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM4)


### Accelerated failure time models
In many cases, the proportional hazards assumption does not hold. In such cases we can turn to _accelerated failure time models_\index{survival analysis!accelerated failure time model} (Wei, 1992), for which the effect of covariates is to accelerate or decelerate the life course of a subject.

While the proportional hazards model is semiparametric, accelerated failure time models are typically fully parametric, and thus involve stronger assumptions about an underlying distribution. When fitting such a model using the `survreg`\index{\texttt{survreg}} function, we must therefore specify what distribution to use. Two common choices are the Weibull distribution and the log-logistic distribution. The Weibull distribution is commonly used in engineering, e.g. in reliability studies. The hazard function of Weibull models is always monotonic, i.e. either always increasing or always decreasing. In contrast, the log-logistic distribution allows the hazard function to be non-monotonic, making it more flexible, and often more appropriate for biological studies. Let's fit both types of models to the `lung` data and have a look at the results:

```{r eval=FALSE}
library(survival)

# Fit Weibull model:
m_w <- survreg(Surv(time, status) ~ age + sex, data = lung,
             dist = "weibull", model = TRUE)
summary(m_w)

# Fit log-logistic model:
m_ll <- survreg(Surv(time, status) ~ age + sex, data = lung,
             dist = "loglogistic", model = TRUE)
summary(m_ll)
```

Interpreting the coefficients of accelerated failure time models is easier than interpreting coefficients from proportional hazards models. The exponentiated coefficients show the relative increase or decrease in the expected survival times when a covariate is increased one step while all others are kept fixed:

```{r eval=FALSE}
exp(coef(m_ll))
```

In this case, according to the log-logistic model, the expected survival time decreases by 1.4 % (i.e. multiply by $0.986$) for each additional year that the patient has lived. The expected survival time for females (`sex=2`) is 61.2 % higher than for males (multiply by $1.612$).

To obtain bootstrap confidence intervals and p-values for the effects, we follow the same procedure as for the Cox model, using `censboot_summary`. Here is an example for the log-logistic accelerated failure time model:

```{r eval=FALSE}
library(boot.pval)
# Original coefficients:
censboot_summary(m_ll, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m_ll, type = "perc", coef = "exp")
```

We can also use `censboot`:

```{r eval=FALSE}
# Function to get the bootstrap replicates of the exponentiated
# coefficients:
boot_fun <- function(data, formula, distr) {
     m_boot <- survreg(formula, data = data, dist = distr)
     return(exp(coef(m_boot)))
}

# Run the resampling:
library(boot)
boot_res <- censboot(lung[,c("time", "status", "age", "sex")],
                     boot_fun, R = 999, 
                     formula = 
                       formula(Surv(time, status) ~ age + sex),
                     distr = "loglogistic")

# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 2) # Age
boot.ci(boot_res, type = "perc", index = 3) # Sex
```

$$\sim$$

```{exercise, label="ch7excLMM5"}
Consider the `ovarian` data from the `survival` package. Fit a log-logistic accelerated failure time model to the data, using all available explanatory variables. What is the estimated difference in survival times between the two treatment groups? 

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLMM5)


### Bayesian survival analysis
At the time of this writing, the latest release of `rstanarm` does not contain functions for fitting survival analysis models. You can check whether this still is the case by running `?stan_surv` in the Console. If you don't find the documentation for the `stan_surv` function, you will have to install the development version of the package from GitHub (which contains such functions), using the following code:

```{r eval=FALSE}
# Check if the devtools package is installed, and start
# by installing it otherwise:
if (!require(devtools)) {
     install.packages("devtools")
}
library(devtools)
# Download and install the development version of the package:
install_github("stan-dev/rstanarm", build_vignettes = FALSE)
```

Now, let's have a look at how to fit a Bayesian model to the `lung` data from `survival`:

```{r eval=FALSE}
library(survival)
library(rstanarm)

# Fit proportional hazards model using cubic M-splines (similar
# but not identical to the Cox model!):
m <- stan_surv(Surv(time, status) ~ age + sex, data = lung)
m
```

Fitting a survival model with a random effect works similarly, and uses the same syntax as `lme4`. Here is an example with the `retinopathy` data:

```{r eval=FALSE}
m <- stan_surv(Surv(futime, status) ~ age + type + trt + (1|id),
                data = retinopathy)
m
```

### Multivariate survival analysis
Some trials involve multiple time-to-event outcomes that need to be assessed simultaneously in a multivariate analysis.\index{survival analysis!multivariate} Examples includes studies of the time until each of several correlated symptoms or comorbidities occur. This is analogous to the multivariate testing problem of Section \@ref(hotellingst2), but with right-censored data. To test for group differences for a vector of right-censored outcomes, a multivariate version of the logrank test described in Persson et al. (2019) can be used. It is available through the `MultSurvTests` package:

```{r eval=FALSE}
install.packages("MultSurvTests")
```

As an example, we'll use the `diabetes`\index{\texttt{diabetes}} dataset from `MultSurvTest`. It contains two time-to-event outcomes: time until blindness in a treated eye and in an untreated eye.

```{r eval=FALSE}
library(MultSurvTests)
# Diabetes data:
?diabetes
```

We'll compare two groups that received two different treatments. The survival times (time until blindness) and censoring statuses of the two groups are put in a matrices called `z` and `z.delta`, which are used as input for the test function `perm_mvlogrank`\index{\texttt{perm\_mvlogrank}}:

```{r eval=FALSE}
# Survival times for the two groups:
x <- as.matrix(subset(diabetes, LASER==1)[,c(6,8)])
y <- as.matrix(subset(diabetes, LASER==2)[,c(6,8)])

# Censoring status for the two groups:
delta.x <- as.matrix(subset(diabetes, LASER==1)[,c(7,9)])
delta.y <- as.matrix(subset(diabetes, LASER==2)[,c(7,9)])

# Create the input for the test:
z <- rbind(x, y)
delta.z <- rbind(delta.x, delta.y)

# Run the test with 499 permutations:
perm_mvlogrank(B = 499, z, delta.z, n1 = nrow(x))
```

### Power estimates for the logrank test
The `spower`\index{\texttt{spower}} function in `Hmisc` can be used to compute the power of the univariate logrank test in different scenarios using simulation. The helper functions `Weibull2`,  `Lognorm2`, and `Gompertz2`\index{\texttt{Weibull2}}\index{\texttt{Lognorm2}}\index{\texttt{Gompertz2}} can be used to define Weibull, lognormal and Gomperts distributions to sample from, using survival probabilities at different time points rather than the traditional parameters of those distributions. We'll look at an example involving the Weibull distribution here. Additional examples can be found in the function's documentation (`?spower`).

Let's simulate the power of a 3-year follow-up study with two arms (i.e. two groups, control and intervention). First, we define a Weibull distribution for (compliant) control patients. Let's say that their 1-year survival is 0.9 and their 3-year survival is 0.6. To define a Weibull distribution that corresponds to these numbers, we use `Weibull2` as follows:

```{r eval=FALSE}
weib_dist <- Weibull2(c(1, 3), c(.9, .6))
```

We'll assume that the treatment has no effect for the first 6 months, and that it then has a constant effect, leading to a hazard ratio of 0.75 (so the hazard ratio is 1 if the time in years is less than or equal to 0.5, and 0.75 otherwise). Moreover, we'll assume that there is a constant drop-out rate, such that 20 % of the patients can be expected to drop out during the three years. Finally, there is no drop-in. We define a function to simulate survival times under these conditions:

```{r eval=FALSE}
# In the functions used to define the hazard ratio, drop-out
# and drop-in, t denotes time in years:
sim_func <- Quantile2(weib_dist, 
      hratio = function(t) { ifelse(t <= 0.5, 1, 0.75) },
      dropout = function(t) { 0.2*t/3 },
      dropin = function(t) { 0 })
```

Next, we define a function for the censoring distribution, which is assumed to be the same for both groups. Let's say that each follow-up is done at a random time point between 2 and 3 years. We'll therefore use a uniform distribution on the interval $(2,3)$ for the censoring distribution:

```{r eval=FALSE}
rcens <- function(n)
{
   runif(n, 2, 3)
}
```

Finally, we define two helper functions required by `spower` and then run the simulation study. The output is the simulated power using the settings that we've just created.

```{r eval=FALSE}
# Define helper functions:
rcontrol <- function(n) { sim_func(n, "control") }
rinterv  <- function(n) { sim_func(n, "intervention") }

# Simulate power when both groups have sample size 300:
spower(rcontrol, rinterv, rcens, nc = 300, ni = 300, 
       test = logrank, nsim = 999)

# Simulate power when both groups have sample size 450:
spower(rcontrol, rinterv, rcens, nc = 450, ni = 450, 
       test = logrank, nsim = 999)

# Simulate power when the control group has size 100
# and the intervention group has size 300:
spower(rcontrol, rinterv, rcens, nc = 100, ni = 300, 
       test = logrank, nsim = 999)
```


## Left-censored data and nondetects
Survival data is typically right-censored. Left-censored data, on the other hand, is common in medical research (e.g. in biomarker studies) and environmental chemistry (e.g. measurements of chemicals in water), where some measurements fall below the laboratory's _detection limits_ (or limit of detection, LoD). Such data also occur in studies in economics. A measurement below the detection limit, a _nondetect_, is still more informative than having no measurement at all - we may not know the exact value, but we know that the measurement is below a given threshold. 

In principle, all methods that are applicable to survival analysis can also be used for left-censored data (although the interpretation of coefficients and parameters may differ), but in practice the distributions of lab measurements and economic variables often differ from those that typically describe survival times. In this section we'll look at methods tailored to the kind of left-censored data that appears in applications in the aforementioned fields.

### Estimation
The `EnvStats`\index{\texttt{EnvStats}} package contains a number of functions that can be used to compute descriptive statistics and estimating parameters of distributions from data with nondetects. Let's install it:

```{r eval=FALSE}
install.packages("EnvStats")
```

Estimates of the mean and standard deviation of a normal distribution that take the censoring into account in the right way can be obtained with `enormCensored`\index{\texttt{enormCensored}}, which allows us to use several different estimators (the details surrounding the available estimators can be found using `?enormCensored`). Analogous functions are available for other distributions, for instance `elnormAltCensored`\index{\texttt{elnormAltCensored}} for the lognormal distribution, `egammaCensored`\index{\texttt{egammaAltCensored}} for the gamma distribution, and `epoisCensored`\index{\texttt{epoisCensored}} for the Poisson distribution.

To illustrate the use of `enormCensored`, we will generate data from a normal distribution. We know the true mean and standard deviation of the distribution, and can compute the estimates for the generated sample. We will then pretend that there is a detection limit for this data, and artificially left-censor about 20 % of it. This allows us to compare the estimates for the full sample and the censored sample, to see how the censoring affects the estimates. Try running the code below a few times:

```{r eval=FALSE}
# Generate 50 observations from a N(10, 9)-distribution:
x <- rnorm(50, 10, 3)

# Estimate the mean and standard deviation:
mean_full <- mean(x)
sd_full <- sd(x)

# Censor all observations below the "detection limit" 8
# and replace their values by 8:
censored <- x<8
x[censored] <- 8

# The proportion of censored observations is:
mean(censored)

# Estimate the mean and standard deviation in a naive
# manner, using the ordinary estimators with all
# nondetects replaced by 8:
mean_cens_naive <- mean(x)
sd_cens_naive <- sd(x)

# Estimate the mean and standard deviation using
# different estimators that take the censoring
# into account:

library(EnvStats)
# Maximum likelihood estimate:
estimates_mle <- enormCensored(x, censored,
                               method = "mle")
# Biased-corrected maximum likelihood estimate:
estimates_bcmle <- enormCensored(x, censored,
                               method = "bcmle")
# Regression on order statistics, ROS, estimate:
estimates_ros <- enormCensored(x, censored,
                               method = "ROS")

# Compare the different estimates:
mean_full; sd_full
mean_cens_naive; sd_cens_naive
estimates_mle$parameters
estimates_bcmle$parameters
estimates_ros$parameters
```

The naive estimators tend to be biased for data with nondetects (sometimes very biased!). Your mileage may vary depending on e.g. the sample size and the amount of censoring, but in general, the estimators that take censoring into account will fare much better.

After we have obtained estimates for the parameters of the normal distribution, we can plot the data against the fitted distribution to check the assumption of normality:

```{r eval=FALSE}
library(ggplot2)
# Compare to histogram, including a bar for nondetects:
ggplot(data.frame(x), aes(x)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                 args = list(mean = estimates_mle$parameters[1], 
                                sd = estimates_mle$parameters[2]))

# Compare to histogram, excluding nondetects:
x_noncens <- x[!censored]
ggplot(data.frame(x_noncens), aes(x_noncens)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                 args = list(mean = estimates_mle$parameters[1], 
                                sd = estimates_mle$parameters[2]))
```

To obtain percentile and BCa bootstrap confidence intervals for the mean, we can add the options `ci = TRUE` and `ci.method = "bootstrap"`:

```{r eval=FALSE}
# Using 999 bootstrap replicates:
enormCensored(x, censored, method = "mle",
              ci = TRUE,  ci.method = "bootstrap",
              n.bootstraps = 999)$interval$limits
```

$$\sim$$

```{exercise, label="ch7excLoD1"}
[Download the `il2rb.csv` data from the book's web page](http://www.modernstatisticswithr.com/data.zip)\index{data!\texttt{il2rb.csv}}. It contains measurements of the biomarker IL-2RB made in serum samples from two groups of patients. The values that are missing are in fact nondetects, with detection limit 0.25.

Under the assumption that the biomarker levels follow a lognormal distribution, compute bootstrap confidence intervals for the mean of the distribution for the control group. What proportion of the data is left-censored?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLoD1)


### Tests of means
When testing the difference between two groups' means, nonparametric tests like the Wilcoxon-Mann-Whitney test often perform very well for data with nondetects, unlike the t-test (Zhang et al., 2009). For data with a high degree of censoring (e.g. more than 50 %), most tests perform poorly. For multivariate tests of mean vectors the situation is the opposite, with Hotelling's $T^2$ (Section \@ref(hotellingst2)) being a much better option than nonparametric tests (Thulin, 2016).

$$\sim$$

```{exercise, label="ch7excLoD2"}
Return to the `il2rb.csv` data from Exercise \@ref(exr:ch7excLoD2). Test the hypothesis that there is no difference in location between the two groups.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutionsLoD2)


### Censored regression
Censored regression models can be used when the response variable is censored. A common model in economics is the Tobit regression model\index{linear regression!Tobit} (Tobin, 1958), which is a linear regression model with normal errors, tailored to left-censored data. It can be fitted using `survreg`.

As an example, consider the `EPA.92c.zinc.df` dataset available in `EnvStats`. It contains measurements of zinc concentrations from five wells, made on 8 samples from each well, half of which are nondetects. Let's say that we are interested in comparing these five wells (so that the wells aren't random effects). Let's also assume that the 8 samples were collected at different time points, and that we want to investigate whether the concentrations change over time. Such changes could be non-linear, so we'll include the sample number as a factor. To fit a Tobit model to this data, we use `survreg` as follows.

```{r eval=FALSE}
library(EnvStats)
?EPA.92c.zinc.df

# Note that in Surv, in the vector describing censoring 0 means
# censoring and 1 no censoring. This is the opposite of the 
# definition used in EPA.92c.zinc.df$Censored, so we use the !
# operator to change 0's to 1's and vice versa.
library(survival)
m <- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
                data = EPA.92c.zinc.df, dist = "gaussian")
summary(m)
```

Similarly, we can fit a model under the assumption of lognormality:

```{r eval=FALSE}
m <- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
                data = EPA.92c.zinc.df, dist = "lognormal")
summary(m)
```

Fitting regression models where the explanatory variables are censored is more challenging. For prediction, a good option is models based on decision trees, studied in Section \@ref(mlmethods). For testing whether there is a trend over time, tests based on Kendall's correlation coefficient can be useful. `EnvStats` provides two functions for this - `kendallTrendTest`\index{\texttt{kendallTrendTest}} for testing a monotonic trend, and `kendallSeasonalTrendTest`\index{\texttt{kendallSeasonalTrendTest}} for testing a monotonic trend within seasons.


## Creating matched samples
_Matching_ is used to balance the distribution of explanatory variables in the groups that are being compared. This is often required in observational studies, where the treatment variable is not randomly assigned, but determined by some external factor(s) that may be related to the treatment. For instance, if you wish to study the effect of smoking on mortality, you can recruit a group of smokers and non-smokers and follow them for a few years. But both mortality and smoking are related to _confounding_ variables such as age and gender, meaning that imbalances in the age and gender distributions of smokers and non-smokers can bias the results. There are several methods for creating balanced or _matched samples_\index{matched samples} that seek to mitigate this bias, including _propensity score matching_, which we'll use here. The `MatchIt`\index{\texttt{MatchIt}} and `optmatch`\index{\texttt{optmatch}} packages contain the functions that we need for this.

To begin with, let's install the two packages:

```{r eval=FALSE}
install.packages(c("MatchIt", "optmatch"))
```

We will illustrate the use of the packages using the `lalonde`\index{data!\texttt{lalonde}} dataset, that is shipped with the `MatchIt` package:

```{r eval=FALSE}
library(MatchIt)
data(lalonde)
?lalonde
View(lalonde)
```

Note that the data has row names, which are useful e.g. for identifying which individuals have been paired - we can access them using `rownames(lalonde)`.

### Propensity score matching
To perform automated propensity score matching, we will use the `matchit`\index{\texttt{matchit}} function, which computes propensity scores and then matches participants from the treatment and control groups using these. Matches can be found in several ways. We'll consider two of them here. As input, the `matchit` function takes a formula describing the treatment variable and potential confounders, what datasets to use, which method to use and what ratio of control to treatment participants to use.

A common method is _nearest neighbour matching_, where each participant is matched to the participant in the other group with the most similar propensity score. By default, it starts by finding a match for the participant in the treatment group that has the largest propensity score, then finds a match for the participant in the treatment groups with the second largest score, and so on. Two participants cannot be matched with the same participant in the control group. The nearest neighbour match is _locally optimal_ in the sense that it find the best (still) available match for each participant in the treatment group, ignoring if that match in fact would be even better for another participant in the treatment group.

To perform propensity score matching using nearest neighbour matching with 1 match each, evaluate the results and then extract the matched samples we can use `matchit` as follows:

```{r eval=FALSE}
matches <- matchit(treat ~ re74 + re75 + age + educ + married,
                 data = lalonde, method = "nearest", ratio = 1)

summary(matches)
plot(matches)
plot(matches, type = "hist")

matched_data <- match.data(matches)
summary(matched_data)
```

To view the matched pairs, you can use:

```{r eval=FALSE}
matches$match.matrix
```

To view the values of the `re78` variable of the matched pairs, use:

```{r eval=FALSE}
varName <- "re78"
resMatrix <- lalonde[row.names(matches$match.matrix), varName]
for(i in 1:ncol(matches$match.matrix))
{
    resMatrix <- cbind(resMatrix, lalonde[matches$match.matrix[,i],
                                          varName])
}
rownames(resMatrix) <- row.names(matches$match.matrix)
View(resMatrix)
```

As an alternative to nearest neighbour-matching, _optimal matching_ can be used. This is similar to nearest neighbour-matching, but strives to obtain _globally optimal_ matches rather than locally optimal. This means that each participant in the treatment group is paired with a participant in the control group, while also taking into account how similar the latter participant is to other participants in the treatment group.

To perform propensity score matching using optimal matching with 2 matches each:

```{r eval=FALSE}
matches <- matchit(treat ~ re74 + re75 + age + educ + married,
                   data = lalonde, method = "optimal", ratio = 2)

summary(matches)
plot(matches)
plot(matches, type = "hist")

matched_data <- match.data(matches)
summary(matched_data)
```

You may also want to find all controls that match participants in the treatment group exactly. This is called exact matching:

```{r eval=FALSE}
matches <- matchit(treat ~ re74 + re75 + age + educ + married,
                   data = lalonde, method = "exact")

summary(matches)
plot(matches)
plot(matches, type = "hist")

matched_data <- match.data(matches)
summary(matched_data)
```

Participants with no exact matches won't be included in `matched_data`.


### Stepwise matching
At times you will want to combine the above approaches. For instance, you may want to have an exact match for age, and then an approximate match using the propensity scores for other variables. This is also achievable but requires the matching to be done in several steps. To first match the participant exactly on age and then 1-to-2 via nearest-neighbour propensity score matching on `re74` and `re75` we can use a loop:

```{r eval=FALSE}
# Match exactly one age:
matches <- matchit(treat ~ age, data = lalonde, method = "exact")
matched_data <- match.data(matches)

# Match the first subclass 1-to-2 via nearest-neighbour propensity
# score matching: 
matches2 <- matchit(treat ~ re74 + re75,
                    data = matched_data[matched_data$subclass == 1,],
                    method = "nearest", ratio = 2)
matched_data2 <- match.data(matches2, weights = "weights2",
                            subclass = "subclass2")
matchlist <- matches2$match.matrix

# Match the remaining subclasses in the same way:
for(i in 2:max(matched_data$subclass))
{
	matches2 <- matchit(treat ~ re74 + re75,
	             data = matched_data[matched_data$subclass == i,],
	             method = "nearest", ratio = 2)
	matched_data2 <- rbind(matched_data2, match.data(matches2,
	                                      weights = "weights2",
	                                      subclass = "subclass2"))
	matchlist <- rbind(matchlist, matches2$match.matrix)
}

# Check results:
View(matchlist)
View(matched_data2)
```

\vfill