This repository has been archived by the owner on Aug 27, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path07-stats.Rmd
4161 lines (2948 loc) · 221 KB
/
07-stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Modern classical statistics {#modchapter}
"Modern classical" may sound like a contradiction, but it is in fact anything but. Classical statistics covers topics like estimation, quantification of uncertainty, and hypothesis testing - all of which are at the heart of data analysis. Since the advent of modern computers, much has happened in this field that has yet to make it to the standard textbooks of introductory courses in statistics. This chapter attempts to bridge part of that gap by dealing with those classical topics, but with a modern approach that uses more recent advances in statistical theory and computational methods. Particular focus is put on how simulation can be used for analyses and for evaluating the properties of statistical procedures.
Whenever it is feasible, our aim in this chapter and the next is to:
* Use hypothesis tests that are based on permutations or the bootstrap rather than tests based on strict assumptions about the distribution of the data or asymptotic distributions,
* To complement estimates and hypothesis tests with computing confidence intervals based on sound methods (including the bootstrap),
* Offer easy-to-use Bayesian methods as an alternative to frequentist tools.
After reading this chapter, you will be able to use R to:
* Generate random numbers,
* Perform simulations to assess the performance of statistical methods,
* Perform hypothesis tests,
* Compute confidence intervals,
* Make sample size computations,
* Report statistical results.
## Simulation and distributions {#simulation}
A _random variable_\index{random variable} is a variable whose value describes the outcome of a random phenomenon. A (probability) _distribution_\index{distribution} is a mathematical function that describes the probability of different outcomes for a random variable. Random variables and distributions are at the heart of probability theory and most, if not all, statistical models.
As we shall soon see, they are also invaluable tools when evaluating statistical methods. A key component of modern statistical work is _simulation_\index{simulation}, in which we generate artificial data that can be used both in the analysis of real data (e.g. in permutation tests and bootstrap confidence intervals, topics that we'll explore in this chapter) and for assessing different methods. Simulation is possible only because we can generate random numbers, so let's begin by having a look at how we can generate random numbers in R.
### Generating random numbers
The function `sample`\index{\texttt{sample}} can be used to randomly draw a number of elements from a vector. For instance, we can use it to draw 2 random numbers from the first ten integers: $1, 2, \ldots, 9, 10$:
```{r eval=FALSE}
sample(1:10, 2)
```
Try running the above code multiple times. You'll get different results each time, because each time it runs the random number generator is in a different _state_. In most cases, this is desirable (if the results were the same each time we used `sample`, it wouldn't be random), but not if we want to replicate a result at some later stage.
When we are concerned about reproducibility, we can use `set.seed`\index{\texttt{set.seed}} to fix the state of the random number generator:
```{r eval=FALSE}
# Each run generates different results:
sample(1:10, 2); sample(1:10, 2)
# To get the same result each time, set the seed to a
# number of your choice:
set.seed(314); sample(1:10, 2)
set.seed(314); sample(1:10, 2)
```
We often want to use simulated data from a probability distribution, such as the normal distribution. The normal distribution is defined by its mean $\mu$ and its variance $\sigma^2$ (or, equivalently, its standard deviation $\sigma$). There are special functions for generating data from different distributions - for the normal distribution it is called `rnorm`. We specify the number of observations that we want to generate (`n`) and the parameters of the distribution (the mean `mu` and the standard deviation `sigma`):
```{r eval=FALSE}
rnorm(n = 10, mu = 2, sigma = 1)
# A shorter version:
rnorm(10, 2, 1)
```
Similarly, there are functions that can be used compute the quantile function, density function, and cumulative distribution function (CDF) of the normal distribution. Here are some examples for a normal distribution with mean 2 and standard deviation 1:
```{r eval=FALSE}
qnorm(0.9, 2, 1) # Upper 90 % quantile of distribution
dnorm(2.5, 2, 1) # Density function f(2.5)
pnorm(2.5, 2, 1) # Cumulative distribution function F(2.5)
```
$$\sim$$
```{exercise, label="ch7exc1"}
Sampling can be done with or without _replacement_. If replacement is used, an observation can be drawn more than once. Check the documentation for `sample`. How can you change the settings to sample with replacement? Draw 5 random numbers from the first ten integers, with replacement.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions1)
### Some common distributions {#distfunctions}
Next, we provide the syntax for random number generation, quantile functions, density/probability functions and cumulative distribution functions for some of the most commonly used distributions. This section is mainly intended as a reference, for you to look up when you need to use one of these distributions - so there is no need to run all the code chunks below right now.
Normal distribution $N(\mu, \sigma^2)$ with mean $\mu$ and variance $\sigma^2$\index{distribution!normal}\index{\texttt{rnorm}}:
```{r eval=FALSE}
rnorm(n, mu, sigma) # Generate n random numbers
qnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dnorm(x, mu, sigma) # Density function f(x)
pnorm(x, mu, sigma) # Cumulative distribution function F(X)
```
Continuous uniform distribution $U(a,b)$ on the interval $(a,b)$, with mean $\frac{a+b}{2}$ and variance $\frac{(b-a)^2}{12}$\index{distribution!uniform}\index{\texttt{runif}}:
```{r eval=FALSE}
runif(n, a, b) # Generate n random numbers
qunif(0.95, a, b) # Upper 95 % quantile of distribution
dunif(x, a, b) # Density function f(x)
punif(x, a, b) # Cumulative distribution function F(X)
```
Exponential distribution $Exp(m)$ with mean $m$ and variance $m^2$\index{distribution!exponential}\index{\texttt{rexp}}:
```{r eval=FALSE}
rexp(n, 1/m) # Generate n random numbers
qexp(0.95, 1/m) # Upper 95 % quantile of distribution
dexp(x, 1/m) # Density function f(x)
pexp(x, 1/m) # Cumulative distribution function F(X)
```
Gamma distribution $\Gamma(\alpha, \beta)$ with mean $\frac{\alpha}{\beta}$ and variance $\frac{\alpha}{\beta^2}$\index{distribution!gamma}\index{\texttt{rgamma}}:
```{r eval=FALSE}
rgamma(n, alpha, beta) # Generate n random numbers
qgamma(0.95, alpha, beta) # Upper 95 % quantile of distribution
dgamma(x, alpha, beta) # Density function f(x)
pgamma(x, alpha, beta) # Cumulative distribution function F(X)
```
Lognormal distribution $LN(\mu, \sigma^2)$ with mean $\exp(\mu+\sigma^2/2)$ and variance $(\exp(\sigma^2)-1)\exp(2\mu+\sigma^2)$\index{distribution!lognormal}\index{\texttt{rlnorm}}:
```{r eval=FALSE}
rlnorm(n, mu, sigma) # Generate n random numbers
qlnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dlnorm(x, mu, sigma) # Density function f(x)
plnorm(x, mu, sigma) # Cumulative distribution function F(X)
```
t-distribution $t(\nu)$ with mean 0 (for $\nu>1$) and variance $\frac{\nu}{\nu-2}$ (for $\nu>2$)\index{distribution!t}\index{\texttt{rt}}:
```{r eval=FALSE}
rt(n, nu) # Generate n random numbers
qt(0.95, nu) # Upper 95 % quantile of distribution
dt(x, nu) # Density function f(x)
pt(x, nu) # Cumulative distribution function F(X)
```
Chi-squared distribution $\chi^2(k)$ with mean $k$ and variance $2k$\index{distribution!$\chi^2$}\index{\texttt{rchisq}}:
```{r eval=FALSE}
rchisq(n, k) # Generate n random numbers
qchisq(0.95, k) # Upper 95 % quantile of distribution
dchisq(x, k) # Density function f(x)
pchisq(x, k) # Cumulative distribution function F(X)
```
F-distribution $F(d_1, d_2)$ with mean $\frac{d_2}{d_2-2}$ (for $d_2>2$) and variance $\frac{2d_2^2(d_1+d_2-2)}{d_1(d_2-2)^2(d_2-4)}$ (for $d_2>4$)\index{distribution!F}\index{\texttt{rf}}:
```{r eval=FALSE}
rf(n, d1, d2) # Generate n random numbers
qf(0.95, d1, d2) # Upper 95 % quantile of distribution
df(x, d1, d2) # Density function f(x)
pf(x, d1, d2) # Cumulative distribution function F(X)
```
Beta distribution\index{distribution!beta}\index{\texttt{rbeta}} $Beta(\alpha,\beta)$ with mean $\frac{\alpha}{\alpha+\beta}$ and variance $\frac{\alpha \beta}{(\alpha+\beta)^2 (\alpha+\beta+1)}$:
```{r eval=FALSE}
rbeta(n, alpha, beta) # Generate n random numbers
qbeta(0.95, alpha, beta) # Upper 95 % quantile of distribution
dbeta(x, alpha, beta) # Density function f(x)
pbeta(x, alpha, beta) # Cumulative distribution function F(X)
```
Binomial distribution $Bin(n,p)$ with mean $np$ and variance $np(1-p)$\index{distribution!binomial}\index{\texttt{rbinom}}:
```{r eval=FALSE}
rbinom(n, n, p) # Generate n random numbers
qbinom(0.95, n, p) # Upper 95 % quantile of distribution
dbinom(x, n, p) # Probability function f(x)
pbinom(x, n, p) # Cumulative distribution function F(X)
```
Poisson distribution $Po(\lambda)$ with mean $\lambda$ and variance $\lambda$\index{distribution!Poisson}\index{\texttt{rpois}}:
```{r eval=FALSE}
rpois(n, lambda) # Generate n random numbers
qpois(0.95, lambda) # Upper 95 % quantile of distribution
dpois(x, lambda) # Probability function f(x)
ppois(x, lambda) # Cumulative distribution function F(X)
```
Negative binomial distribution $NegBin(r, p)$ with mean $\frac{rp}{1-p}$ and variance $\frac{rp}{(1-p)^2}$\index{distribution!negative binomial}\index{\texttt{rnbinom}}:
```{r eval=FALSE}
rnbinom(n, r, p) # Generate n random numbers
qnbinom(0.95, r, p) # Upper 95 % quantile of distribution
dnbinom(x, r, p) # Probability function f(x)
pnbinom(x, r, p) # Cumulative distribution function F(X)
```
Multivariate normal distribution with mean vector $\mu$ and covariance matrix $\Sigma$:
```{r eval=FALSE}
library(MASS)
mvrnorm(n, mu, Sigma) # Generate n random numbers
```
$$\sim$$
```{exercise, label="ch7exc2"}
Use `runif` and (at least) one of `round`, `ceiling` and `floor` to generate observations from a discrete random variable on the integers $1, 2, 3, 4, 5, 6, 7, 8, 9, 10$.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions2)
### Assessing distributional assumptions
So how can we know that the functions for generating random observations from distributions work? And when working with real data, how can we know what distribution fits the data? One answer is that we can visually compare the distribution of the generated (or real) data to the target distribution. This can for instance be done by comparing a histogram of the data to the target distribution's density function.
To do so, we must add `aes(y = ..density..))` to the call to `geom_histogram`, which rescales the histogram to have area 1 (just like a density function has). We can then add the density function using `geom_function`\index{\texttt{geom\_function}}:
```{r eval=FALSE}
# Generate data from a normal distribution with mean 10 and
# standard deviation 1
generated_data <- data.frame(normal_data = rnorm(1000, 10, 1))
library(ggplot2)
# Compare to histogram:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$normal_data),
sd = sd(generated_data$normal_data)))
```
Try increasing the number of observations generated. As the number of observations increase, the histogram should start to look more and more like the density function.
We could also add a density estimate for the generated data, to further aid the eye here - we'd expect this to be close to the theoretical density function:
```{r eval=FALSE}
# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$normal_data),
sd = sd(generated_data$normal_data)))
```
If instead we wished to compare the distribution of the data to a $\chi^2$ distribution, we would change the value of `fun` and `args` in `geom_function` accordingly:
```{r eval=FALSE}
# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dchisq, colour = "red", size = 2,
args = list(df = mean(generated_data$normal_data)))
```
Note that the values of `args` have changed. `args` should always be a list containing values for the parameters of the distribution: `mu` and `sigma` for the normal distribution and `df` for the $\chi^2$ distribution (the same as in Section \@ref(distfunctions)).
Another option is to draw a quantile-quantile plot, or Q-Q plot for short, which compares the theoretical quantiles of a distribution to the empirical quantiles of the data, showing each observation as a point. If the data follows the theorised distribution, then the points should lie more or less along a straight line.
To draw a Q-Q plot for a normal distribution, we use the geoms `geom_qq` and `geom_qq_line`\index{\texttt{geom\_qq}}\index{\texttt{geom\_qq\_line}}:
```{r eval=FALSE}
# Q-Q plot for normality:
ggplot(generated_data, aes(sample = normal_data)) +
geom_qq() + geom_qq_line()
```
For all other distributions, we must provide the quantile function of the distribution (many of which can be found in Section \@ref(distfunctions)):
```{r eval=FALSE}
# Q-Q plot for the lognormal distribution:
ggplot(generated_data, aes(sample = normal_data)) +
geom_qq(distribution = qlnorm) +
geom_qq_line(distribution = qlnorm)
```
Q-Q-plots can be a little difficult to read. There will always be points deviating from the line - in fact, that's expected. So how much must they deviate before we rule out a distributional assumption? Particularly when working with real data, I like to compare the Q-Q-plot of my data to Q-Q-plots of simulated samples from the assumed distribution, to get a feel for what kind of deviations can appear if the distributional assumption holds. Here's an example of how to do this, for the normal distribution:
```{r eval=FALSE}
# Look at solar radiation data for May from the airquality
# dataset:
May <- airquality[airquality$Month == 5,]
# Create a Q-Q-plot for the solar radiation data, and store
# it in a list:
qqplots <- list(ggplot(May, aes(sample = Solar.R)) +
geom_qq() + geom_qq_line() + ggtitle("Actual data"))
# Compute the sample size n:
n <- sum(!is.na(May$Temp))
# Generate 8 new datasets of size n from a normal distribution.
# Then draw Q-Q-plots for these and store them in the list:
for(i in 2:9)
{
generated_data <- data.frame(normal_data = rnorm(n, 10, 1))
qqplots[[i]] <- ggplot(generated_data, aes(sample = normal_data)) +
geom_qq() + geom_qq_line() + ggtitle("Simulated data")
}
# Plot the resulting Q-Q-plots side-by-side:
library(patchwork)
(qqplots[[1]] + qqplots[[2]] + qqplots[[3]]) /
(qqplots[[4]] + qqplots[[5]] + qqplots[[6]]) /
(qqplots[[7]] + qqplots[[8]] + qqplots[[9]])
```
You can run the code several times, to get more examples of what Q-Q-plots can look like when the distributional assumption holds. In this case, the tail points in the Q-Q-plot for the solar radiation data deviate from the line more than the tail points in most simulated examples do, and personally, I'd be reluctant to assume that the data comes from a normal distribution.
$$\sim$$
```{exercise, label="ch7exc3"}
Investigate the sleeping times in the `msleep` data from the `ggplot2` package. Do they appear to follow a normal distribution? A lognormal distribution?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions3)
<br>
```{exercise, label="ch7exc4"}
Another approach to assessing distributional assumptions for real data is to use formal hypothesis tests. One example is the Shapiro-Wilk test for normality, available in `shapiro.test`\index{\texttt{shapiro.test}}. The null hypothesis is that the data comes from a normal distribution, and the alternative is that it doesn't (meaning that a low p-value is supposed to imply non-normality).
1. Apply `shapiro.test` to the sleeping times in the `msleep` dataset. According to the Shapiro-Wilk test, is the data normally distributed?
2. Generate 2,000 observations from a $\chi^2(100)$ distribution. Compare the histogram of the generated data to the density function of a normal distribution. Are they similar? What are the results when you apply the Shapiro-Wilk test to the data?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions4)
### Monte Carlo integration
In this chapter, we will use simulation to compute p-values and confidence intervals, to compare different statistical methods, and to perform sample size computations. Another important use of simulation is in _Monte Carlo integration_, in which random numbers are used for numerical integration. It plays an important role in for instance statistical physics, computational biology, computational linguistics, and Bayesian statistics; fields that require the computation of complicated integrals.
To create an example of Monte Carlo integration, let's start by writing a function, `circle`, that defines a quarter-circle on the unit square. We will then plot it using the geom `geom_function`\index{\texttt{geom\_function}}:
```{r eval=FALSE}
circle <- function(x)
{
return(sqrt(1-x^2))
}
ggplot(data.frame(x = c(0, 1)), aes(x)) +
geom_function(fun = circle)
```
Let's say that we are interest in computing the area under quarter-circle. We can highlight the area in our plot using `geom_area`\index{\texttt{geom\_area}}:
```{r eval=FALSE}
ggplot(data.frame(x = seq(0, 1, 1e-4)), aes(x)) +
geom_area(aes(x = x,
y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)),
fill = "pink") +
geom_function(fun = circle)
```
To find the area, we will generate a large number of random points uniformly in the unit square. By the law of large numbers, the proportion of points that end up under the quarter-circle should be close to the area under the quarter-circle^[In general, the proportion of points that fall below the curve will be proportional to the area under the curve _relative_ to the area of the sample space. In this case the sample space is the unit square, which has area 1, meaning that the relative area is the same as the absolute area.]. To do this, we generate 10,000 random values for the $x$ and $y$ coordinates of each point using the $U(0,1)$ distribution, that is, using `runif`:
```{r eval=FALSE}
B <- 1e4
unif_points <- data.frame(x = runif(B), y = runif(B))
```
Next, we add the points to our plot:
```{r eval=FALSE}
ggplot(unif_points, aes(x, y)) +
geom_area(aes(x = x,
y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)),
fill = "pink") +
geom_point(size = 0.5, alpha = 0.25,
colour = ifelse(unif_points$x^2 + unif_points$y^2 <= 1,
"red", "black")) +
geom_function(fun = circle)
```
Note the order in which we placed the geoms - we plot the points after the area so that the pink colour won't cover the points, and the function after the points so that the points won't cover the curve.
To estimate the area, we compute the proportion of points that are below the curve:
```{r eval=FALSE}
mean(unif_points$x^2 + unif_points$y^2 <= 1)
```
In this case, we can also compute the area exactly: $\int_0^1\sqrt{1-x^2}dx=\pi/4=0.7853\ldots$. For more complicated integrals, however, numerical integration methods like Monte Carlo integration may be required. That being said, there are better numerical integration methods for low-dimensional integrals like this one. Monte Carlo integration is primarily used for higher-dimensional integrals, where other techniques fail.
## Student's t-test revisited {#ttest}
For decades teachers all over the world have been telling the story of William Sealy Gosset: the head brewer at Guinness who derived the formulas used for the t-test and, following company policy, published the results under the pseudonym "Student".
Gosset's work was hugely important, but the passing of time has rendered at least parts of it largely obsolete. His distributional formulas were derived out of necessity: lacking the computer power that we have available to us today, he was forced to impose the assumption of normality on the data, in order to derive the formulas he needed to be able to carry out his analyses. Today we can use simulation to carry out analyses with fewer assumptions. As an added bonus, these simulation techniques often happen to result in statistical methods with better performance than Student's t-test and other similar methods.
### The old-school t-test {#oldschoolt}
The _really_ old-school way of performing a t-test - the way statistical pioneers like Gosset and Fisher would have done it - is to look up p-values using tables covering several pages. There haven't really been any excuses for doing that since the advent of the personal computer though, so let's not go further into that. The "modern" version of the old-school t-test uses numerical evaluation of the formulas for Student's t-distribution to compute p-values and confidence intervals. Before we delve into more modern approaches, let's look at how we can run an old-school t-test in R.\index{hypothesis test!t-test}
In Section \@ref(firstttest) we used `t.test`\index{\texttt{t.test}} to run a t-test to see if there is a difference in how long carnivores and herbivores sleep, using the `msleep` data from `ggplot2`^[Note that this is not a random sample of mammals, and so one of the fundamental assumptions behind the t-test isn't valid in this case. For the purpose of showing how to use the t-test, the data is good enough though.]. First, we subtracted a subset of the data corresponding to carnivores and herbivores, and then we ran the test. There are in fact several different ways of doing this, and it is probably a good idea to have a look at them.
In the approach used in Section \@ref(firstttest) we created two vectors, using bracket notation, and then used those as arguments for `t.test`:
```{r eval=FALSE}
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
t.test(carnivores$sleep_total, herbivores$sleep_total)
```
Alternatively, we could have used formula notation, as we e.g. did for the linear model in Section \@ref(firstlm). We'd then have to use the `data` argument in `t.test` to supply the data. By using `subset`, we can do the subsetting simultaneously:
```{r eval=FALSE}
t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
```
Unless we are interested in keeping the vectors `carnivores` and `herbivores` for other purposes, this latter approach is arguably more elegant.
Speaking of elegance, the `data` argument also makes it easy to run a t-test using pipes. Here is an example, where we use `filter` from `dplyr` to do the subsetting:
```{r eval=FALSE}
library(dplyr)
msleep %>% filter(vore == "carni" | vore == "herbi") %>%
t.test(sleep_total ~ vore, data = .)
```
We could also use the `magrittr` pipe `%$%` from Section \@ref(morepipes) to pass the variables from the filtered subset of `msleep`, avoiding the `data` argument:
```{r eval=FALSE}
library(magrittr)
msleep %>% filter(vore == "carni" | vore == "herbi") %$%
t.test(sleep_total ~ vore)
```
There are even more options than this - the point that I'm trying to make here is that like most functions in R, you can use functions for classical statistics in many different ways. In what follows, I will show you one or two of these, but don't hesitate to try out other approaches if they seem better to you.
What we just did above was a two-sided t-test, where the null hypothesis was that there was no difference in means between the groups, and the alternative hypothesis that there was a difference. We can also perform one-sided tests using the `alternative` argument. `alternative = "greater"` means that the alternative is that the first group has a greater mean, and `alternative = "less"` means that the first group has a smaller mean. Here is an example with the former:
```{r eval=FALSE}
t.test(sleep_total ~ vore,
data = subset(msleep, vore == "carni" | vore == "herbi"),
alternative = "greater")
```
By default, R uses the Welch two-sample t-test, meaning that it is _not_ assumed that the groups have equal variances. If you don't want to make that assumption, you can add `var.equal = TRUE`:
```{r eval=FALSE}
t.test(sleep_total ~ vore,
data = subset(msleep, vore == "carni" | vore == "herbi"),
var.equal = TRUE)
```
In addition to two-sample t-tests, `t.test` can also be used for one-sample tests and paired t-tests. To perform a one-sample t-test, all we need to do is to supply a single vector with observations, along with the value of the mean $\mu$ under the null hypothesis. I usually sleep for about 7 hours each night, and so if I want to test whether that is true for an average mammal, I'd use the following:
```{r eval=FALSE}
t.test(msleep$sleep_total, mu = 7)
```
As we can see from the output, your average mammal sleeps for 10.4 hours per day. Moreover, the p-value is quite low - apparently, I sleep unusually little for a mammal!
As for paired t-tests, we can perform them by supplying two vectors (where element 1 of the first vector corresponds to element 1 of the second vector, and so on) and the argument `paired = TRUE`. For instance, using the `diamonds` data from `ggplot2`, we could run a test to see if the length `x` of diamonds with a fair quality of the cut on average equals the width `y`:
```{r eval=FALSE}
fair_diamonds <- subset(diamonds, cut == "Fair")
t.test(fair_diamonds$x, fair_diamonds$y, paired = TRUE)
```
$$\sim$$
```{exercise, label="ch7exc5"}
Load the VAS pain data `vas.csv` from Exercise \@ref(exr:ch3exc4). Perform a one-sided t-test to see test the null hypothesis that the average VAS among the patients during the time period is less than or equal to 6.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions5)
### Permutation tests
Maybe it was a little harsh to say that Gosset's formulas have become obsolete. The formulas are mathematical approximations to the distribution of the test statistics under the null hypothesis. The truth is that they work very well as long as your data is (nearly) normally distributed. The two-sample test also works well for non-normal data as long as you have balanced sample sizes, that is, equally many observations in both groups. However, for one-sample tests, and two-sample tests with imbalanced sample sizes, there are better ways to compute p-values and confidence intervals than to use Gosset's traditional formulas.\index{hypothesis test!permutation test}
The first option that we'll look at is permutation tests. Let's return to our mammal sleeping times example, where we wanted to investigate whether there are differences in how long carnivores and herbivores sleep on average:
```{r eval=FALSE}
t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
```
There are 19 carnivores and 32 herbivores - 51 animals in total. If there are no differences between the two groups, the `vore` labels offer no information about how long the animals sleep each day. Under the null hypothesis, the assignment of `vore` labels to different animals is therefore for all intents and purposes random. To find the distribution of the test statistic under the null hypothesis, we could look at all possible ways to assign 19 animals the label `carnivore` and 32 animals the label `herbivore`. That is, look at all permutations of the labels. The probability of a result at least as extreme as that obtained in our sample (in the direction of the alternative), i.e. the p-value, would then be the proportion of permutations that yield a result at least extreme as that in our sample. This is known as a permutation test.
Permutation tests were known to the likes of Gosset and Fisher (Fisher's exact test is a common example), but because the number of permutations of labels often tend to become quite large (76,000 billion, in our carnivore-herbivore example), they lacked the means actually to use them. 76,000 billion permutations may be too many even today, but we can obtain very good approximations of the p-values of permutation tests using simulation.
The idea is that we look at a large number of randomly selected permutations, and check for how many of them we obtain a test statistic that is more extreme than the sample test statistic. The law of large number guarantees that this proportion will converge to the permutation test p-value as the number of randomly selected permutations increases.
Let's have a go!
```{r eval=FALSE}
# Filter the data, to get carnivores and herbivores:
data <- subset(msleep, vore == "carni" | vore == "herbi")
# Compute the sample test statistic:
sample_t <- t.test(sleep_total ~ vore, data = data)$statistic
# Set the number of random permutations and create a vector to
# store the result in:
B <- 9999
permutation_t <- vector("numeric", B)
# Start progress bar:
pbar <- txtProgressBar(min = 0, max = B, style = 3)
# Compute the test statistic for B randomly selected permutations
for(i in 1:B)
{
# Draw a permutation of the labels:
data$vore <- sample(data$vore, length(data$vore),
replace = FALSE)
# Compute statistic for permuted sample:
permutation_t[i] <- t.test(sleep_total ~ vore,
data = data)$statistic
# Update progress bar
setTxtProgressBar(pbar, i)
}
close(pbar)
# In this case, with a two-sided alternative hypothesis, a
# "more extreme" test statistic is one that has a larger
# absolute value than the sample test statistic.
# Compute approximate permutation test p-value:
mean(abs(permutation_t) > abs(sample_t))
```
In this particular example, the resulting p-value is pretty close to that from the old-school t-test. However, we will soon see examples where the two versions of the t-test differ more.
You may ask why we used 9,999 permutations and not 10,000. The reason is that we avoid p-values that are equal to traditional significance levels like 0.05 and 0.01 this way. If we'd used 10,000 permutations, 500 of which yielded a statistics that had a larger absolute value than the sample statistic, then the p-value would have been exactly 0.05, which would cause some difficulties in trying to determine whether or not the result was significant at the 5 % level. This cannot happen when we use 9,999 permutations instead (500 statistics with a large absolute value yields the p-value $0.050005>0.05$, and 499 yields the p-value $0.0499<0.05$).
Having to write a `for` loop every time we want to run a t-test seems unnecessarily complicated. Fortunately, others have tread this path before us. The `MKinfer`\index{\texttt{MKinfer}} package contains a function to perform (approximate) permutation t-tests, which also happens to be faster than our implementation above. Let's install it:
```{r eval=FALSE}
install.packages("MKinfer")
```
The function for the permutation t-test, `perm.t.test`\index{\texttt{perm.t.test}}\index{hypothesis test!t-test, permutation}, works exactly like `t.test`. In all the examples from Section \@ref(oldschoolt) we can replace `t.test` with `perm.t.test` to run a permutation t-test instead. Like so:
```{r eval=FALSE}
library(MKinfer)
perm.t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
```
Note that two p-values and confidence intervals are presented: one set from the permutations and one from the old-school approach - so make sure that you look at the right ones!
You may ask how many randomly selected permutations we need to get an accurate approximation of the permutation test p-value. By default, `perm.t.test` uses 9,999 permutations (you can change that number using the argument `R`), which is widely considered to be a reasonable number. If you are running a permutation test with a much more complex (and computationally intensive) statistic, you may have to use a lower number, but avoid that if you can.
### The bootstrap
A popular method for computing p-values and confidence intervals that resembles the permutation approach is the bootstrap. Instead of drawing permuted samples, new observations are drawn with replacement from the original sample, and then labels are randomly allocated to them. That means that each randomly drawn sample will differ not only in the permutation of labels, but also in what observations are included - some may appear more than once and some not at all.
We will have a closer look at the bootstrap in Section \@ref(bootstrap), where we will learn how to use it for creating confidence intervals and computing p-values for any test statistic. For now, we'll just note that `MKinfer` offers a bootstrap version of the t-test, `boot.t.test` \index{\texttt{boot.t.test}}\index{hypothesis test!t-test, bootstrap}:
```{r eval=FALSE}
library(MKinfer)
boot.t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
```
Both `perm.test` and `boot.test` have a useful argument called `symmetric`, the details of which are discussed in depth in Section \@ref(twotypesofpvalues).
### Saving the output
When we run a t-test, the results are printed in the Console. But we can also store the results in a variable, which allows us to access e.g. the p-value of the test:
```{r eval=FALSE}
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
test_result <- t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
test_result
```
What does the resulting object look like?
```{r eval=FALSE}
str(test_result)
```
As you can see, `test_result` is a `list` containing different parameters and vectors for the test. To get the p-value, we can run the following:
```{r eval=FALSE}
test_result$p.value
```
### Multiple testing {#multipletesting}
Some programming tools from Section \@ref(loopsection) can be of use if we wish to perform multiple t-tests. For example, maybe we want to make pairwise comparisons of the sleeping times of all the different feeding behaviours in `msleep`: carnivores, herbivores, insectivores and omnivores\index{hypothesis test!multiple testing}. To find all possible pairs, we can use a nested `for` loop (Section \@ref(nestedloops)). Note how the indices `i` and `j` that we loop over are set so that we only run the test for each combination once:
```{r eval=FALSE}
library(MKinfer)
# List the different feeding behaviours (ignoring NA's):
vores <- na.omit(unique(msleep$vore))
B <- length(vores)
# Compute the number of pairs, and create an appropriately
# sized data frame to store the p-values in:
n_comb <- choose(B, 2)
p_values <- data.frame(group1 = vector("character", n_comb),
group2 = vector("character", n_comb),
p = vector("numeric", n_comb))
# Loop over all pairs:
k <- 1 # Counter variable
for(i in 1:(B-1))
{
for(j in (i+1):B)
{
# Run a t-test for the current pair:
test_res <- perm.t.test(sleep_total ~ vore,
data = subset(msleep,
vore == vores[i] | vore == vores[j]))
# Store the p-value:
p_values[k, ] <- c(vores[i], vores[j], test_res$p.value)
# Increase the counter variable:
k <- k + 1
}
}
```
To view the p-values for each pairwise test, we can now run:
```{r eval=FALSE}
p_values
```
When we run multiple tests, the risk for a type I error increases, to the point where we're virtually guaranteed to get a significant result. We can reduce the risk of false positive results and adjust the p-values for multiplicity using for instance Bonferroni correction, Holm's method (an improved version of the standard Bonferroni approach), or the Benjamini-Hochberg approach (which controls the _false discovery rate_ and is useful if you for instance are screening a lot of variables for differences), using `p.adjust`\index{\texttt{p.adjust}}:
```{r eval=FALSE}
p.adjust(p_values$p, method = "bonferroni")
p.adjust(p_values$p, method = "holm")
p.adjust(p_values$p, method = "BH")
```
### Multivariate testing with Hotelling's $T^2$ {#hotellingst2}
If you are interested in comparing the means of several variables for two groups, using a multivariate test is sometimes a better option than running multiple univariate t-tests. The multivariate generalisation of the t-test, Hotelling's $T^2$, is available through the `Hotelling`\index{\texttt{Hotelling}} package:
```{r eval=FALSE}
install.packages("Hotelling")
```
As an example, consider the `airquality` data. Let's say that we want to test whether the mean ozone, solar radiation, wind speed, and temperature differ between June and July. We could use four separate t-tests to test this, but we could also use Hotelling's $T^2$ to test the null hypothesis that the mean vector, i.e. the vector containing the four means, is the same for both months. The function used for this is `hotelling.test`\index{\texttt{hotelling.test}}:
```{r eval=FALSE}
# Subset the data:
airquality_t2 <- subset(airquality, Month == 6 | Month == 7)
# Run the test under the assumption of normality:
library(Hotelling)
t2 <- hotelling.test(Ozone + Solar.R + Wind + Temp ~ Month,
data = airquality_t2)
t2
# Run a permutation test instead:
t2 <- hotelling.test(Ozone + Solar.R + Wind + Temp ~ Month,
data = airquality_t2, perm = TRUE)
t2
```
### Sample size computations for the t-test
In any study, it is important to collect enough data for the inference that we wish to make. If we want to use a t-test for a test about a mean or the difference of two means, what constitutes "enough data" is usually measured by the power of the test. The sample is large enough when the test achieves high enough power. If we are comfortable assuming normality (and we may well be, especially as the main goal with sample size computations is to get a ballpark figure), we can use `power.t.test`\index{\texttt{power.t.test}} to compute what power our test would achieve under different settings. For a two-sample test with unequal variances, we can use `power.welch.t.test`\index{\texttt{power.welch.t.test}} from `MKpower`\index{\texttt{MKpower}} instead. Both functions can be used to either find the sample size required for a certain power, or to find out what power will be obtained from a given sample size.
First of all, let's install `MKpower`:
```{r eval=FALSE}
install.packages("MKpower")
```
`power.t.test` and `power.welch.t.test` both use `delta` to denote the mean difference under the alternative hypothesis. In addition, we must supply the standard deviation`sd` of the distribution. Here are some examples:
```{r eval=FALSE}
library(MKpower)
# A one-sided one-sample test with 80 % power:
power.t.test(power = 0.8, delta = 1, sd = 1, sig.level = 0.05,
type = "one.sample", alternative = "one.sided")
# A two-sided two-sample test with sample size n = 25 and equal
# variances:
power.t.test(n = 25, delta = 1, sd = 1, sig.level = 0.05,
type = "two.sample", alternative = "two.sided")
# A one-sided two-sample test with 90 % power and equal variances:
power.t.test(power = 0.9, delta = 1, sd = 0.5, sig.level = 0.01,
type = "two.sample", alternative = "one.sided")
# A one-sided two-sample test with 90 % power and unequal variances:
power.welch.t.test(power = 0.9, delta = 1, sd1 = 0.5, sd2 = 1,
sig.level = 0.01,
type = "two.sample", alternative = "one.sided")
```
You may wonder how to choose `delta` and `sd`. If possible, it is good to base these numbers on a pilot study or related previous work. If no such data is available, your guess is as good as mine. For `delta`, some useful terminology comes from medical statistics, where the concept of _clinical significance_ is used increasingly often. Make sure that `delta` is large enough to be clinically significant, that is, large enough to actually matter in practice.
If we have reason to believe that the data follows a non-normal distribution, another option is to use simulation to compute the sample size that will be required. We'll do just that in Section \@ref(sscus).
```{exercise, label="ch7exc7"}
Return to the one-sided t-test that you performed in Exercise \@ref(exr:ch7exc5). Assume that `delta` is 0.5 (i.e. that the true mean is 6.5) and that the standard deviation is 2. How large does the sample size $n$ have to be for the power of the test to be 95 % at a 5 % significance level? What is the power of the test when the sample size is $n=2,351$?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions7)
### A Bayesian approach
The Bayesian paradigm differs in many ways from the frequentist approach that we use in the rest of this chapter. In Bayesian statistics, we first define a _prior distribution_ for the parameters that we are interested in, representing our beliefs about them (for instance based on previous studies). Bayes' theorem is then used to derive the _posterior distribution_, i.e. the distribution of the coefficients given the prior distribution and the data. Philosophically, this is very different from frequentist estimation, in which we don't incorporate prior beliefs into our models (except for through which variables we include).
In many situations, we don't have access to data that can be used to create an _informative_ prior distribution. In such cases, we can use a so-called weakly informative prior instead. These act as a sort of "default priors", representing large uncertainty about the values of the coefficients.
The `rstanarm` package contains methods for using Bayesian estimation to fit some common statistical models. It takes a while to install, but it is well worth it:
```{r eval=FALSE}
install.packages("rstanarm")
```
To use a Bayesian model with a weakly informative prior to analyse the difference in sleeping time between herbivores and carnivores, we load `rstanarm` and use `stan_glm`\index{\texttt{glm}} in complete analogue with how we use `t.test`:
```{r eval=FALSE}
library(rstanarm)
library(ggplot2)
m <- stan_glm(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
# Print the estimates:
m
```
There are two estimates here: an "intercept" (the average sleeping time for carnivores) and `voreherbi` (the difference between carnivores and herbivores). To plot the posterior distribution of the difference, we can use `plot`:
```{r eval=FALSE}
plot(m, "dens", pars = c("voreherbi"))
```
To get a 95 % credible interval (the Bayesian equivalent of a confidence interval) for the difference, we can use `posterior_interval`\index{\texttt{posterior\_interval}} as follows:
```{r eval=FALSE}
posterior_interval(m,
pars = c("voreherbi"),
prob = 0.95)
```
p-values are not a part of Bayesian statistics, so don't expect any. It is however possible to perform a kind of Bayesian test of whether there is a difference by checking whether the credible interval for the difference contains 0. If not, there is evidence that there is a difference (Thulin, 2014c). In this case, 0 is contained in the interval, and there is no evidence of a difference.
In most cases, Bayesian estimation is done using Monte Carlo integration (specifically, a class of methods known as Markov Chain Monte Carlo, MCMC). To check that the model fitting has converged, we can use a measure called $\hat{R}$. It should be less than 1.1 if the fitting has converged:
```{r eval=FALSE}
plot(m, "rhat")
```
If the model fitting hasn't converged, you may need to increase the number of iterations of the MCMC algorithm. You can increase the number of iterations by adding the argument `iter` to `stan_glm` (the default is 2,000).
If you want to use a custom prior for your analysis, that is of course possible too. See `?priors` and `?stan_glm` for details about this, and about the default weakly informative prior.
## Other common hypothesis tests and confidence intervals
There are thousands of statistical tests in addition to the t-test, and equally many methods for computing confidence intervals for different parameters. In this section we will have a look at some useful tools: the nonparametric Wilcoxon-Mann-Whitney test for location, tests for correlation, $\chi^2$-tests for contingency tables, and confidence intervals for proportions.
### Nonparametric tests of location
The Wilcoxon-Mann-Whitney test, `wilcox.test` in R\index{\texttt{wilcox.test}}, is a nonparametric alternative to the t-test that is based on ranks. `wilcox.test` can be used in complete analogue to `t.test`.
We can use two vectors as input:
```{r eval=FALSE}
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
wilcox.test(carnivores$sleep_total, herbivores$sleep_total)
```
Or use a formula:
```{r eval=FALSE}
wilcox.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
```
### Tests for correlation
To test the null hypothesis that two numerical variables are correlated, we can use `cor.test`\index{\texttt{cor.test}}\index{hypothesis test!correlation}. Let's try it with sleeping times and brain weight, using the `msleep` data again:
```{r eval=FALSE}
library(ggplot2)
cor.test(msleep$sleep_total, msleep$brainwt,
use = "pairwise.complete")
```
The setting `use = "pairwise.complete"` means that `NA` values are ignored.
`cor.test` doesn't have a `data` argument, so if you want to use it in a pipeline I recommend using the `%$%` pipe (Section \@ref(morepipes)) to pass on the vectors from your data frame:
```{r eval=FALSE}
library(magrittr)
msleep %$% cor.test(sleep_total, brainwt, use = "pairwise.complete")
```
The test we just performed uses the Pearson correlation coefficient as its test statistic. If you prefer, you can use the nonparametric Spearman and Kendall correlation coefficients in the test instead, by changing the value of `method`:
```{r eval=FALSE}
# Spearman test of correlation:
cor.test(msleep$sleep_total, msleep$brainwt,
use = "pairwise.complete",
method = "spearman")
```
These tests are all based on asymptotic approximations, which among other things causes the Pearson correlation test perform poorly for non-normal data. In Section \@ref(bootstrap) we will create a bootstrap version of the correlation test, which has better performance.
### $\chi^2$-tests
$\chi^2$ (chi-squared) tests are most commonly used to test whether two categorical variables are independent. To use it, we must first construct a contingency table, i.e. a table showing the counts for different combinations of categories, typically using `table`. Here is an example with the `diamonds` data from `ggplot2`:
```{r eval=FALSE}
library(ggplot2)
table(diamonds$cut, diamonds$color)
```
The null hypothesis of our test is that the quality of the cut (`cut`) and the colour of the diamond (`color`) are independent, with the alternative being that they are dependent. We use `chisq.test`\index{\texttt{chisq.test}}\index{hypothesis test!independence}\index{hypothesis test!chi-squared} with the contingency table as input to run the $\chi^2$ test of independence:
```{r eval=FALSE}
chisq.test(table(diamonds$cut, diamonds$color))
```
By default, `chisq.test` uses an asymptotic approximation of the p-value. For small sample sizes, it is almost often better to use permutation p-values by setting `simulate.p.value = TRUE` (but here the sample is not small, and so the computation of the permutation test will take a while):
```{r eval=FALSE}
chisq.test(table(diamonds$cut, diamonds$color),
simulate.p.value = TRUE)
```
As with `t.test`, we can use pipes to perform the test if we like:
```{r eval=FALSE}
library(magrittr)
diamonds %$% table(cut, color) %>%
chisq.test()
```
If both of the variables are binary, i.e. only take two values, the power of the test can be approximated using `power.prop.test`\index{\texttt{power.prop.test}}. Let's say that we have two variables, $X$ and $Y$, taking the values 0 and 1. Assume that we collect $n$ observations with $X=0$ and $n$ with $X=1$. Furthermore, let `p1` be the probability that $Y=1$ if $X=0$ and `p2` be the probability that $Y=1$ if $X=1$. We can then use `power.prop.test` as follows:
```{r eval=FALSE}
# Assume that n = 50, p1 = 0.4 and p2 = 0.5 and compute the power:
power.prop.test(n = 50, p1 = 0.4, p2 = 0.5, sig.level = 0.05)
# Assume that p1 = 0.4 and p2 = 0.5 and that we want 85 % power.
# To compute the sample size required:
power.prop.test(power = 0.85, p1 = 0.4, p2 = 0.5, sig.level = 0.05)
```
### Confidence intervals for proportions {#confprop}
The different t-test functions provide confidence intervals for means and differences of means. But what about proportions? The `binomCI`\index{\texttt{binomCI}}\index{confidence interval!proportion} function in the `MKinfer` package allows us to compute confidence intervals for proportions from binomial experiments using a number of methods. The input is the number of "successes" `x`, the sample size `n`, and the `method` to be used.
Let's say that we want to compute a confidence interval for the proportion of herbivore mammals that sleep for more than 7 hours a day.
```{r eval=FALSE}
library(ggplot2)
herbivores <- msleep[msleep$vore == "herbi",]
# Compute the number of animals for which we know the sleeping time:
n <- sum(!is.na(herbivores$sleep_total))
# Compute the number of "successes", i.e. the number of animals
# that sleep for more than 7 hours:
x <- sum(herbivores$sleep_total > 7, na.rm = TRUE)
```
The estimated proportion is `x/n`, which in this case is 0.625. We'd like to quantify the uncertainty in this estimate by computing a confidence interval. The standard Wald method, taught in most introductory courses, can be computed using:
```{r eval=FALSE}
library(MKinfer)
binomCI(x, n, conf.level = 0.95, method = "wald")
```
Don't do that though! The Wald interval is known to be severely flawed (Brown et al., 2001), and much better options are available. If the proportion can be expected to be close to 0 or 1, the Clopper-Pearson interval is recommended, and otherwise the Wilson interval is the best choice (Thulin, 2014a):
```{r eval=FALSE}
binomCI(x, n, conf.level = 0.95, method = "clopper-pearson")
binomCI(x, n, conf.level = 0.95, method = "wilson")
```
An excellent Bayesian credible interval is the Jeffreys interval, which uses the weakly informative Jeffreys prior:
```{r eval=FALSE}
binomCI(x, n, conf.level = 0.95, method = "jeffreys")
```
The `ssize.propCI`\index{\texttt{ssize.propCI}} function in `MKpower` can be used to compute the sample size needed to obtain a confidence interval with a given width^[Or rather, a given _expected_, or average, width. The width of the interval is a function of a random variable, and is therefore also random.]. It relies on asymptotic formulas that are highly accurate, as you later on will verify in Exercise \@ref(exr:ch7excCP).
```{r eval=FALSE}
library(MKpower)
# Compute the sample size required to obtain an interval with
# width 0.1 if the true proportion is 0.4:
ssize.propCI(prop = 0.4, width = 0.1, method = "wilson")
ssize.propCI(prop = 0.4, width = 0.1, method = "clopper-pearson")
```
$$\sim$$
```{exercise, label="ch7exc8"}
The function `binomDiffCI`\index{\texttt{binomDiffCI}}\index{confidence interval!difference of proportions} from `MKinfer` can be used to compute a confidence interval for the _difference_ of two proportions. Using the `msleep` data, use it to compute a confidence interval for the difference between the proportion of herbivores that sleep for more than 7 hours a day and the proportion of carnivores that sleep for more than 7 hours a day.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch7solutions8)
## Ethical issues in statistical inference {#ethicsinference}
The use and misuse of statistical inference offer many ethical dilemmas. Some common issues related to ethics and good statistical practice are discussed below. As you read them and work with the associated exercises, consider consulting the ASA's ethical guidelines, presented in Section \@ref(ethicalguidelines).
### p-hacking and the file-drawer problem
Hypothesis tests are easy to misuse. If you run enough tests on your data, you are almost guaranteed to end up with significant results - either due to chance or because some of the null hypotheses you test are false. The process of trying lots of different tests (different methods, different hypotheses, different sub-groups) in search of significant results is known as _p-hacking_ or _data dredging_\index{p-hacking}. This greatly increases the risk of false findings, and can often produce misleading results.
Many practitioners inadvertently resort to p-hacking, by mixing exploratory data analysis and hypothesis testing, or by coming up with new hypotheses to test as they work with their data. This can be avoided by planning your analyses in advance, a practice that in fact is required in medical trials.
On the other end of the spectrum, there is the _file-drawer problem_, in which studies with negative (i.e. not statistically significant) results aren't published or reported, but instead are stored in the researcher's file-drawers. There are many reasons for this, one being that negative results usually are seen as less important and less worthy of spending time on. Simply put, negative results just aren't news. If your study shows that eating kale every day significantly reduces the risk of cancer, then that is news, something that people are interested in learning, and something that can be published in a prestigious journal. However, if your study shows that a daily serving of kale has no impact on the risk of cancer, that's not news, people aren't really interested in hearing it, and it may prove difficult to publish your findings.
But what if 100 different researchers carried out the same study? If eating kale doesn't affect the risk of cancer, then we can still expect 5 out of these researchers to get significant results (using a 5 % significance level). If only those researchers publish their results, that may give the impressions that there is strong evidence of the cancer-preventing effect of kale backed up by several papers, even though the majority of studies actually indicated that there was no such effect.
$$\sim$$
```{exercise, label="ch7ethics1"}
_Discuss the following._ You are helping a research team with statistical analysis of data that they have collected. You agree on five hypotheses to test. None of the tests turns out significant. Fearing that all their hard work won't lead anywhere, your collaborators then ask you to carry out five new tests. Neither turns out significant. Your collaborators closely inspect the data and then ask you to carry out ten more tests, two of which are significant. The team wants to publish these significant results in a scientific journal. Should you agree to publish them? If so, what results should be published? Should you have put your foot down and told them not to run more tests? Does your answer depend on how long it took the research team to collect the data? What if the team won't get funding for new projects unless they publish a paper soon? What if other research teams competing for the same grants do their analyses like this?
```
<br>
```{exercise, label="ch7ethics2"}
_Discuss the following._ You are working for a company that is launching a new product, a hair-loss treatment. In a small study, the product worked for 19 out of 22 participants (86 %). You compute a 95 % Clopper-Pearson confidence interval (Section \@ref(confprop)) for the proportion of successes and find that it is (0.65, 0.97). Based on this, the company wants to market the product as being 97 % effective. Is that acceptable to you? If not, how should it be marketed? Would your answer change if the product was something else (new running shoes that make you faster, a plastic film that protects smartphone screens from scratches, or contraceptives)? What if the company wanted to market it as being 86 % effective instead?
```
<br>
```{exercise, label="ch7ethics3"}
_Discuss the following._ You have worked long and hard on a project. In the end, to see if the project was a success, you run a hypothesis test to check if two variables are correlated. You find that they are not (p = 0.15). However, if you remove three outliers, the two variables are significantly correlated (p = 0.03). What should you do? Does your answer change if you only have to remove one outlier to get a significant result? If you have to remove ten outliers? 100 outliers? What if the p-value is 0.051 before removing the outliers and 0.049 after removing the outliers?
```
<br>
```{exercise, label="ch7ethics4"}
_Discuss the following._ You are analysing data from an experiment to see if there is a difference between two treatments. You estimate^[We'll discuss methods for producing such estimates in Section \@ref(simpower).] that given the sample size and the expected difference in treatment effects, the power of the test that you'll be using, i.e. the probability of rejecting the null hypothesis if it is false, is about 15 %. Should you carry out such an analysis? If not, how high does the power need to be for the analysis to be meaningful?
```
### Reproducibility
An analysis is _reproducible_\index{reproducibility} if it can be reproduced by someone else. By producing reproducible analyses, we make it easier for others to scrutinise our work. We also make all the steps in the data analysis transparent. This can act as a safeguard against data fabrication and data dredging.
In order to make an analysis reproducible, we need to provide at least two things. First, _the data_ - all unedited data files in their original format. This also includes _metadata_ with information required to understand the data (e.g. codebooks explaining variable names and codes used for categorical variables). Second, _the computer code_ used to prepare and analyse the data. This includes any wrangling and preliminary testing performed on the data.
As long as we save our data files and code, data wrangling and analyses in R are inherently reproducible, in contrast to the same tasks carried out in menu-based software such as Excel. However, if reports are created using a word processor, there is always a risk that something will be lost along the way. Perhaps numbers are copied by hand (which may introduce errors), or maybe the wrong version of a figure is pasted into the document. R Markdown (Section \@ref(rmarkdown)) is a great tool for creating completely reproducible reports, as it allows you to integrate R code for data wrangling, analyses, and graphics in your report-writing. This reduces the risk of manually inserting errors, and allows you to share your work with others easily.
$$\sim$$
```{exercise, label="ch7ethics5"}
_Discuss the following._ You are working on a study at a small-town hospital. The data involves biomarker measurements for a number of patients, and you show that patients with a sexually transmittable disease have elevated levels of some of the biomarkers. The data also includes information about the patients: their names, ages, ZIP codes, heights, and weights. The research team wants to publish your results and make the analysis reproducible. Is it ethically acceptable to share all your data? Can you make the analysis reproducible without violating patient confidentiality?
```
## Evaluating statistical methods using simulation {#simeval}
An important use of simulation is in the evaluation of statistical methods. In this section, we will see how simulation can be used to compare the performance of two estimators, as well as the type I error rate and power of hypothesis tests.
### Comparing estimators
Let's say that we want to estimate the mean $\mu$ of a normal distribution. We could come up with several different estimators for $\mu$:
* The sample mean $\bar{x}$,
* The sample median $\tilde{x}$,
* The average of the largest and smallest value in the sample: $\frac{x_{max}+x_{min}}{2}$.
In this particular case (under normality), statistical theory tells us that the sample mean is the best estimator^[At least in terms of mean squared error.]. But how much better is it, really? And what if we didn't know statistical theory - could we use simulation to find out which estimator to use?
To begin with, let's write a function that computes the estimate $\frac{x_{max}+x_{min}}{2}$:
```{r eval=FALSE}
max_min_avg <- function(x)
{
return((max(x)+min(x))/2)
}
```
Next, we'll generate some data from a $N(0,1)$ distribution and compute the three estimates:
```{r eval=FALSE}
x <- rnorm(25)
x_mean <- mean(x)
x_median <- median(x)
x_mma <- max_min_avg(x)
x_mean; x_median; x_mma
```
As you can see, the estimates given by the different approaches differ, so clearly the choice of estimator matters. We can't determine which to use based on a single sample though. Instead, we typically compare the long-run properties of estimators, such as their _bias_ and _variance_. The bias is the difference between the mean of the estimator and the parameter it seeks to estimate. An estimator is _unbiased_ if its bias is 0, which is considered desirable at least in this setting. Among unbiased estimators, we prefer the one that has the smallest variance. So how can we use simulation to compute the bias and variance of estimators?\index{simulation!bias and variance}
The key to using simulation here is to realise that `x_mean` is an observation of the random variable $\bar{X}= \frac{1}{25}(X_1+X_2+\cdots+X_{25})$ where each $X_i$ is $N(0, 1)$-distributed. We can generate observations of $X_i$ (using `rnorm`), and can therefore also generate observations of $\bar{X}$. That means that we can obtain an arbitrarily large sample of observations of $\bar{X}$, which we can use to estimate its mean and variance. Here is an example:
```{r eval=FALSE}
# Set the parameters for the normal distribution:
mu <- 0
sigma <- 1
# We will generate 10,000 observations of the estimators: