-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4.Rmd
310 lines (215 loc) · 7.8 KB
/
4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
---
title: "Homework 4 - BSTA 522"
author: "Matthew Hoctor"
date: "1/25/2022"
output:
html_document:
number_sections: no
theme: lumen
toc: yes
toc_float:
collapsed: yes
smooth_scroll: no
pdf_document:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# library(dplyr)
# library(readxl)
# library(tidyverse)
# library(ggplot2)
# library(CarletonStats)
# library(pwr)
# library(BSDA)
# library(exact2x2)
# library(car)
# library(dvmisc)
# library(emmeans)
# library(gridExtra)
# library(DescTools)
# library(DiagrammeR)
# library(nlme)
# library(doBy)
# library(geepack)
# library(rje)
library(ISLR2)
# library(psych)
# library(MASS)
# library(caret) #required for confusionMatrix function
# library(rje)
# library(class) #required for the knn function
# library(e1071) #required for the naiveBayes function
library(boot) #required for the boot function
```
# Part C
Chap 5: #2, #9 (use B=1000 and set.seed(1237) in all questions)
## 5.2
We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.
### a
What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.
By the law of total probability, we know that the probability that the jth observation is the first bootstrap observation and the probability that it is not the first bootstrap observation sum to one. Any of the n observations have an equal probability of being the first bootstrap observation; therefore the probability that the first bootstrap observation is the jth observation from the original sample is $\frac{1}{n}$. And therefore the probability that the first bootstrap observation is not the jth from the original sample is:
$$1- \frac{1}{n} = \frac{n-1}{n}$$
### b
What is the probability that the second bootstrap observation is not the jth observation from the original sample?
Bootstrap sampling is done with replacement. Therefore the probability that the second bootstrap observation (and for any bootstrap observation) is not the jth from the original sample is equal to the probability that the first observation is not the jth from the original sample:
$$\frac{n-1}{n}$$
### c
The probability that the jth observation is not in the bootstrap sample is the product of the probabilities that the jth observation is not selected during each sampling procedure. If we assume that the bootstrap sample size is equal to the number of observations in the original sample (i.e. $B=n$), then the probability that the jth observation is not in the bootstrap sample can be given by:
$$\mbox{P}[(x_j, y_j)\notin Z^*] = \prod_{i=1}^n \mbox{P}[(x_i^*, y_i^*) \neq (x_j, y_j)] = \prod_{i=1}^n (1- \frac{1}{n}) = (1- \frac{1}{n})^n$$
Where $Z^*$ denotes the set of observations in the bootstrap sample,$(x_j, y_j)$ denotes the jth observation in the original sample, and $(x_i^*, y_i^*)$ denotes the ith observation in the bootstrap sample.
### d
When n = 5, what is the probability that the jth observation is in the bootstrap sample?
We can compute this probability with the following code:
```{r}
n <-5
(1 - 1/n)^n
```
### e
When n = 100, what is the probability that the jth observation is in the bootstrap sample?
We can compute this probability with the following code:
```{r}
n <-100
(1 - 1/n)^n
```
### f
When n = 10, 000, what is the probability that the jth observation is in the bootstrap sample?
We can compute this probability with the following code:
```{r}
n <-10000
(1 - 1/n)^n
```
### g
Create a plot that displays, for each integer value of n from 1 to 100, 000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.
```{r}
n <- 1:10000
probability <- rep (NA , 10000)
for (i in n) {
probability[i] <- (1 - 1/i)^i
}
plot(n,probability)
abline(h = exp(-1),
col = "red")
```
Observation: the probability quickly converges.
### h
We will now investigate numerically the probability that a bootstrap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample.
```{r}
store <- rep (NA , 10000)
for (i in 1:10000) {
store [i] <- sum ( sample (1:100 , rep = TRUE ) == 4) > 0
}
mean ( store )
```
This answer is very close to the analytical solution:
```{r}
1-probability[100]
```
Note that it can be shown analytically that:
$$\lim_{n \rightarrow \infty} ( \frac{n-1}{n} )^n =e^{-1} \approx 0.3678794$$
So our numerical solution (0.367861) is pretty close.
## 5.9
### Setup
Set seed:
```{r}
set.seed(1237)
```
Boston dataset:
```{r}
boston <- Boston
```
Set number of bootstrapping iterations:
```{r}
B <- 1000
```
Function:
```{r}
# Non-indexed standard error:
std <- function(x) {sd(x)/sqrt(length(x))}
# Indexed standard error:
std2 <- function(x,i) {sd(x[i])/sqrt(length(x[i]))}
# Indexed mean:
mean2 <- function(x,i) {mean(x[i])}
```
### a
We can estimate the average of the median home value of the 506 suburbs of Boston:
```{r}
mean(boston$medv)
mean2(boston$medv, i = 1:length(boston$medv))
```
$\hat{\mu} = \$22,532.81$
### b
We can compute the standard error in the conventional way:
```{r}
std2(boston$medv, i = 1:length(boston$medv))
std(boston$medv)
```
This tells us that the estimate of the mean will differ from the true population mean by $409, on average.
### c
We can estimate the standard error using the bootstrap:
```{r}
set.seed(1237)
boot(boston$medv,
statistic = std2,
R = B)
```
Alternatte method:
```{r}
set.seed(1237)
boot(boston$medv,
statistic = mean2,
R = B)
```
The bootstrapped standard error of the mean is 0.399, which is quite close to the analytical solution found in part b.
### d
A 95% CI for $\hat{\mu}$ can be calculated from the bootstrapped standard error:
```{r}
22.53281 + 1.96*0.3989197
22.53281 - 1.96*0.3989197
```
Conventional 95% CI:
```{r}
t.test(boston$medv)
```
The bootstrapped 95% CI is $(21.75, 23.31)$, whereas the analytical 95% CI is $(21.73, 23.34)$; these are quite close.
### e
Median value of the suburb median value, $\hat{\mu}_{med}$:
```{r}
median(boston$medv)
```
### f
We can bootstrap the standard error of $\hat{\mu}_{med}$, however we cannot find it analytically:
```{r}
set.seed(1237)
median2 <- function(x,i) {median(x[i])}
boot(boston$medv,
statistic = median2,
R = B)
```
The standard error of $\hat{\mu}_{med}$ is 0.370; this is a similar amount to the standard error of $\hat{\mu}$.
### g
We can estimate the tenth percentile of the median home value, $\hat{\mu}_{0.10}$, with the following code:
```{r}
quantile(boston$medv,
probs = c(0.10))
```
The tenth percentile of median home value is $12,750.00.
### h
We can use the bootstrap to to estimate the standard error of $\hat{\mu}_{0.10}$:
```{r}
set.seed(1237)
quantile2 <- function(x,i) {quantile(x[i], probs = c(0.10))}
boot(boston$medv,
statistic = quantile2,
R = B)
```
We find that the bootstrapped standard error of $\hat{\mu}_{0.10}$ is 0.493. Interestingly, this is not too much more variable than the mean and median values; however relatively speaking the standard error is greater.
# Part D
Why do we want to use set.seed(1237) in #9 above?
Setting a particular seed is important because it ensures reproducibility of work requiring the use of random number generation. If the seed is not specified, the state of .Random.seed will dictate which random numbers are generated subsequently, and output will be different each time the code is run.
# Session Info
```{r}
sessionInfo()
```
# References
1. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. Springer; 2013.