-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path2-4_Analysis.qmd
667 lines (455 loc) · 24.3 KB
/
2-4_Analysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
# Analysis {.unnumbered}
```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
```
Here you will find extra information on analyses.
## Hypothesis testing and Confidence Intervals
Comparison is the most common basic principle in medical research. A statement about
the truth is compared against a reference statement (the null).
- $H_0$: Null hypothesis, e.g. cholesterol is comparable between men and women
- $H_1$: Alternative hypothesis, e.g. men and women differ on average in cholesterol
The p-value is the probability of obtaining the observed data in the sample (or some-
thing more extreme than the observed data), given that the null hypothesis is true. The
p-value is calculated based on a point estimate (e.g. mean) of the sample. The decision
to reject a null hypothesis based on the p-value depends on a chosen $\alpha$ level.
```{r , echo = FALSE, out.height="90%", fig.align="center"}
include_graphics("Pictures/Hypothesistest.png")
```
For a given dataset and corresponding test statistic, we say the results of the test are
"statistically significant" if the p-value is less than the pre-selected and fixed significance
level, $\alpha$ (often 0.05).
Note that there is a distinction between statistical significance and clinical relevance:
- If the sample size is large enough, even a small difference of 0.1 mmHg blood
pressure can be statistically signifcant between groups, though it is not relevant
from a clinical point of view.
- If the sample size is too small, even a sample mean of 150 mmHg can be not
statistically significantly different from 130 mmHg, though 20 mmHg is clinically
relevant.
A confidence interval (CI) is another way to show the reliability of a point estimate.
The decision to reject or not reject the null hypothesis aligns with whether or not the
CI contains the null value (e.g. $H_0$: mean = 0; if CI does not contain 0 then reject,
otherwise do not reject $H_0$). In other words, the decision made by comparing the test
statistic's p-value to $\alpha$ will be the same as a decision made using a $(1 - \alpha) * 100\%$ CI.
An interpretation of e.g. a 95% CI is "if the testing procedure were repeated on many
(k) samples, the confidence intervals would encompass the true population parameter
in 95% of the k samples" or, more abstractly, "we are 95% confident that the true [e.g.
mean] lies in the confidence interval".
## Comparing groups
### t-Test
The t-test is a statistical procedure used to test for the difference in means between
two independent populations. The samples should come from normal distributions (can
check using e.g. `qqnorm()`) and the variances from each population are assumed to be
equal.
```{r , echo = FALSE, out.height="60%", fig.align="center"}
include_graphics("Pictures/ttest.png")
```
The t-test in R makes a correction that does not require equal variances of the two
populations. The null hypothesis is that the difference between means is 0; the alternative is that this difference is not 0.
Here is an example of how to do this test in R. In a data set on tooth growth, it appears that
OJ recipients in the sample had a larger tooth growth than VC recipients, on
average. We can test if this is true in the larger population using a t-test. Let's set $H_0$ to
be that the two mean tooth lengths are the same and $H_a$ will be that the two mean tooth
lengths are not equal. Let's use a 0.10 significance level, which means the probability we
falsely reject the null is 0.10. Note that the function t.test also provides a confidence
interval - the confidence level specified should be $1 - \alpha$ for the interpretations to align.
```{r echo = T, eval=T}
data("ToothGrowth")
ttooth <- t.test(ToothGrowth$len~ToothGrowth$supp, conf.level = 0.90)
ttooth
```
Here the p-value is less than our chosen significance level so we reject the null.
Therefore, at the 10% significance level, the data provide sufficient evidence that our
alternative hypothesis, that there is a difference between using OJ and VC to grow teeth,
is true.
Now consider the following example. Suppose we believe the mean reaction velocity
in an enzymatic reaction will be higher in cells treated with Puromycin compared to
cells not treated with Puromycin. Let's use the 0.05 significance level, allowing for a
slightly smaller Type I error probability than in the previous example.
```{r echo = T, eval=T}
data("Puromycin")
t.test(rate ~ state, data = Puromycin, alternative = "greater")
```
Here our p-value is greater than $\alpha$ so we do not reject the null hypothesis. We
therefore conclude that, at the 5% significance level, the data do not provide sufficient
evidence that Puromycin increases the mean reaction velocity.
We can extract statistics from the output:
```{r echo = T, eval=T}
mytest <- t.test(rate~state, data = Puromycin, alternative = "greater")
mytest$p.value
mytest$conf.int
```
Can you guess why the CI goes to `Inf` (infinity)?
If samples are matched or paired (e.g. before/after), use argument `paired=TRUE`.
### Mann-Whitney U Test
The Mann-Whitney U test (also known as the Wilcoxon Rank Sum test) is a non-
parametric test for a location shift between two independent populations without assuming normality. However, the values should be sampled from two populations with very similar distributions. The null hypothesis is that there is no shift in the centers of the distributions of the two populations; the alternative is that there is a shift.
```{r , echo = FALSE, out.height="90%", fig.align="center"}
include_graphics("Pictures/MWUtest.png")
```
```{r echo = T, eval=T}
wtest <- wilcox.test(rate~state, data = Puromycin, conf.int = TRUE)
wtest
wtest$p.value
```
The test statistic W in R is defined to be the sum of the ranks of one of the groups
minus $n_1(n_1 +1)/2$. The procedure to obtain a confidence interval is quite involved, but
thankfully R will provide it to us upon request.
### ANOVA
A two-sample t-test is used to test hypotheses about the means of two normal populations
using two datasets which were sampled independently of one another from the two
respective populations. An analysis of variance (ANOVA) allows for comparing the
means of one variable among more than two populations (each of which is normally
distributed), again under the assumption that the samples are independent.
- If the variance between groups is higher than the variance within groups, then
there is evidence for a difference in means between groups.
- The null hypothesis is that the means of all groups are equal; the alternative is
that at least one group has a different mean from the others.
- ANOVA does not provide which group is different nor in what direction; visualization and/or post-hoc pairwise t-tests can provide this information.
Here is a small illustration of the analysis of variance:
```{r , echo = FALSE, out.height="90%", fig.align="center"}
include_graphics("Pictures/ANOVA.png")
```
In the left image there is more variability within each of the three groups (boxplots)
than between the three group means ($\bar{x}_1$, $\bar{x}_2$, $\bar{x}_3$). In the right image there is more variability between the three group means than within each of the three groups..
As with the previous testing procedures we've seen, we compare the p-value to the
pre-selected significance level. If $p \leq \alpha$ we reject the null and conclude we have evidence
that at least one population (or group) mean is different from the others, while if $p > \alpha$,
we fail to reject the null and conclude we do not have evidence that any of the means
are different.
R will calculate all of the statistics for us! For example, let's test at the 5% significance level the alternative hypothesis that at least one of the mean BMIs for three different educational levels are different from the other two (versus the null of three equal means).
```{r, message=F, echo = F, eval = T}
library(readr)
R_data <- read_csv("Data/R_data2.csv")
```
```{r, message=F, echo = T, eval = F}
library(readr)
R_data <- read_csv("~/R_data2.csv")
```
```{r echo = T, eval=T}
baby_aov1 <- aov(BMI ~ educational_level, data = R_data)
summary(baby_aov1)
```
Here the F-statistic is large enough to get a p-value smaller than 0.05 so we conclude
at least one population mean BMI is different from the other two. But which group is
it? Let's look at a boxplot:
```{r echo = T, eval=T}
boxplot(R_data$BMI ~ R_data$educational_level)
```
This visual information is informative but we can actually make pairwise comparisons
of all the means using Tukey's HSD (for example). We initially used a significance level of 0.05, but will now conduct several tests and use a multiple test procedure that adjusts the family-wise
error rate to 5%:
```{r echo = T, eval=T}
TukeyHSD(baby_aov1)
```
Note that we use the multiple testing procedure to account for the fact that we are
simultaneously performing more than one test or CI which compounds the error rates
of each test. Since all of the CIs contain 0 and none of the p-values are significant at
the 5% level, the initial test result is overturned! We do not actually have evidence that
education level affects average BMI.
Also note you would not bother with these posthoc tests/CIs if your initial ANOVA
results were not statistically significant.
### Kruskal-Wallis Test
The single-factor ANOVA model for comparing population or treatment means assumed
that for all groups, random samples were drawn from normal populations each having
the same variance. This normality assumption is required for a valid F test, but the
next procedure for testing the equality of the centers of the distributions only requires
that the populations have the same continuous distribution.
The null hypothesis is that all of the group centers are the same and the alternative is
that at least one of the group centers is different. The Kruskal-Wallis test examines the
validity of these hypotheses by working on ranks of the data without assuming the data
come from a specific distribution.
The procedure starts by ranking all the data together on the assumption of the null, that
if the centers of the groups are equal, the ranks from all the groups will be intermingled.
If the null is false, then some samples will consist mostly of observations having small
ranks in the combined sample, whereas others will consist mostly of observations having
large ranks.
The Kruskal-Wallis test statistic is a measure of the extent to which the sums of the
ranks within each group deviate from their common expected value, and the null is rejected if the computed value of the statistic indicates too great a discrepancy between
observed and expected rank averages.
Example. The accompanying observations on axial stiffness index resulted from a
study of metal-plate connected trusses in which five different plate lengths (4 in., 6 in.,
8 in., 10 in., and 12 in.) were used.
```{r , echo = FALSE, out.width="90%", fig.align="center"}
include_graphics("Pictures/KWtest.png")
```
Here we read in the data:
```{r, message=F, echo = F, eval = T}
library(readr)
stiffness <- read_csv("Data/stiffness.csv")
```
```{r, message=F, echo = T, eval = F}
library(readr)
stiffness <- read_csv("~/stiffness.csv")
```
If we look at the boxplot of the stiffness values for each length, we see that the means
are probably different and that the boxplots don't look very normal:
```{r, echo = T, eval = T}
boxplot(stiffness~lengths, data = stiffness)
```
The output of the Kruskal-Wallis test confirms that the means are indeed different. The
p-value is quite small, indicating at least two of the group centers are different:
```{r, echo = T, eval = T}
kruskal.test(stiffness~lengths, data = stiffness)
```
### Chi-squared test
A $\chi^2$ (read: chi-squared) test of independence tests the null hypothesis that rows and
columns are independent in a _c x r_ contingency table (r is number of rows, c columns);
the alternative is that they are not independent. The counts must be independent and
sampled randomly. We can calculate a chi-squared statistic as follows:
$\chi^2 = \sum \frac{(O_i-E_i)^2}{E_i}$
where $O_i$ is the observed count for cell i in the table, and $E_i$ is the expected count, calculated by multiplying the row and column totals for i divided by the overall total. The
p-value for the calculated $\chi^2$ statistic depends on the $\chi^2$ distribution with $(r -1)*(c-1)$
degrees of freedom. R will provide us with the statistic and p-value.
Suppose we have the following table:
| | event | no event | total |
|---------|:-----:|:--------:|:-----:|
|treatment| 20 | 80 | 100|
|placebo | 50 | 50 | 100|
|total | 70 | 130| 200|
and we want to know whether our treatment prevents events, that is, does the occurence
of an event depend on the type of treatment?
```{r echo = T, eval=T}
mytable <- matrix(c(20, 50, 80, 50), nrow=2)
mytable
cstest <- chisq.test(mytable)
cstest
cstest$p.value
```
### Correlations
If we want to know the degree of a linear association between 2 variables, we can calculate
correlation. Correlation does not make any a priori assumptions about whether one
variable is dependent on the other and is not concerned with the relationship between
the variables. We have 4 general types of association to consider:
```{r , echo = FALSE, out.width="90%", fig.align="center"}
include_graphics("Pictures/Correlation2.png")
```
Pearson's r can only lie in the interval [-1,1] (inclusive), where
- r = 0, no linear correlation
- r > 0, positive linear correlation
- r < 0, negative linear correlation
- r = 1, perfect positive linear correlation
- r = -1, perfect negative linear correlation
Note that correlation does not imply causation. If two variables are highly correlated
you cannot infer that one is causing the other; they could both be varying along with a
third, possibly unknown confounding factor (either causal or not).
For Pearson's r we assume a linear relationship between x and y and that they both follow a normal distribution.
If data are not normally distributed, the degree of association can be determined by
the ranked correlation coefficient, Spearman's $\rho$, which replaces the x's and y's in the
Pearson formula with their ranks.
R provides p-values and confidence intervals for both Pearson and Spearman correlations.
Let's look at an example of how BMI and cholesterol are associated:
```{r echo = T, eval=T}
plot(R_data$BMI, R_data$cholesterol)
cor(R_data$BMI, R_data$cholesterol)
cor(R_data$BMI, R_data$cholesterol, method="spearman")
cor.test(R_data$BMI, R_data$cholesterol)
cor.test(R_data$BMI, R_data$cholesterol, method="spearman")
```
Note the warning message (which is not an error!) that indicates if your data have tied
values (e.g. 1, 1, 3, 5) then the p-value is approximated and is not exact. It's nothing
to worry about (especially if you're not a statistician...).
## Regression
### Linear regression
There is a high positive correlation between birth weight and gestational age, but this
says nothing about predictive power of the variables. We would like to explain how
gestational age influences changes in birth weight.
```{r echo = T, eval=T}
plot(R_data$pregnancy_length_weeks, R_data$birthweight)
cor(R_data$pregnancy_length_weeks, R_data$birthweight)
```
We'll quantify this relationship using linear regression, distinguishing between an in-
dependent, or predictor or explanatory, variable (gestational age) and a dependent, or
response or outcome, variable (birth weight). Simple linear regression uses the following
model:
$Y_i = \beta_0 + \beta_1*X_i + \varepsilon_i$
where $1 \leq i \leq n$, the model is a straight line, and error is remaining variation which cannot
be explained by the model.
The parameters $beta_0$ and $\beta_1$ are the intercept and slope of a straight line, respectively.
The $\beta_0$ and $\beta_1$ of the one straight line that best fits the data is estimated via the method
of least squares. The "best" line is the one that has the lowest sum of squared residuals.
The command to get these estimates in R is `lm`:
```{r echo = T, eval=T}
lm1 <- lm(birthweight ~ pregnancy_length_weeks, data = R_data)
lm1
```
You can predict for given values
```{r echo = T, eval=T}
predict(lm1)[R_data$pregnancy_length_weeks==35]
```
You can also use `predict()` to predict y for x's that are not already in your data:
```{r echo = T, eval=T}
predict(lm1, newdata=data.frame(pregnancy_length_weeks=c(seq(25,50,5))))
```
but watch out for extrapolating (predicting outside the range of your data) - clearly
we can't have negative birth weights!
Now, before we make inference on or, actually, even find and use this line, we must
check that the following assumptions hold, otherwise we will not obtain trustworthy
results:
- relationship between x and y can be described by a straight line
- outcomes y are independent
- variance of residuals is constant across values of x
- residuals follow a normal distribution
To get diagnostic plots in R, we can check a histogram of the data and additionally
plot our model to check that the residuals are normal and homoscedastic (have constant
variance) across the weeks:
```{r echo = T, eval=T}
hist(R_data$birthweight)
par(mfrow=c(2,2))
plot(lm1)
```
The top left plot tells us if our residuals are homoscedastic, and the top right plot
displays a quantile-quantile (QQ) plot to check for normality. Here are examples of bad
QQ plots and heteroscedasticity:
```{r echo = T, eval=T}
set.seed(1234)
par(mfrow=c(2,2))
x <- sort(rnorm(100))
y1 <- sort(rt(100,2))
plot(x, y1, xlim=c(-3,3), xlab="normal", ylab="t, df=2")
abline(0, 1)
y2 <- sort(rexp(100))
plot(x, y2, xlim=c(-3,3), ylim=c(-6,6), xlab="normal", ylab="exponential")
abline(0, 1)
plot(x, x^2-5+rexp(100))
abline(0, 0, col="red")
plot(x, x*rexp(100))
abline(0, 0, col="red")
```
However, the assumptions reasonably hold for our baby data, so we'll go ahead and use
the model fit to make inference on the slope.
Actually, R has already done the inference...we just need to extract it from the model:
```{r echo = T, eval=T}
summary(lm1)
```
R has performed a one-sample t-test on the intercept b1 and slope b0 to determine
if they are each statistically significantly different from 0. The probabilities are quite
small, so we can reject the null hypothesis that they are equal to 0 and conclude that
birthweight significantly increases, on average, by 400 grams per every additional week
of gestation. The intercept is (usually) unimportant and we don't really care that it is
different from 0. If the p-value for the slope is not small (e.g. greater than 0.05) then
we would say "we do not have enough evidence to reject the null hypothesis that the
slope is 0."
Now, how good does the model actually fit our data? How well does x predict y? The square of Pearson's correlation coefficient, r, is a measure of goodness of fit. It is the proportion of variance in y that can be explained
by the model (so, x). In our example, r2 is:
```{r echo = T, eval=T}
cor(R_data$pregnancy_length_weeks, R_data$birthweight)^2
summary(lm1)$r.squared
```
which means that 96% of the variability in birth weight can be explained by gestational
age.
Simple linear regression models one y on one x. If we have multiple predictor variables,
we use multiple linear regression to determine if the variability in y can be explained by
this set of variables. In addition to the assumptions required for a valid simple linear
regression, we now include that the covariates have no perfect multicollinearity, that is
there is no strong correlation between the multiple x's. The model is
$Y_i = \beta_0 + \beta_1*X_{1i} + ... + \beta_k*X_{ki}+\varepsilon_i$
In R, the addition of an extra variable is quite straightforward:
```{r echo = T, eval=T}
cor(R_data$pregnancy_length_weeks, R_data$BMI)
lm2 <- lm(birthweight ~ pregnancy_length_weeks + BMI, data = R_data)
lm2stats <- summary(lm2)
lm2stats
```
Since the 2 variables are uncorrelated, we can add BMI to the model. We see that it
does not significantly predict birth weight, but gestational age still does. We use the
adjusted $r^2$ to check goodness of fit:
```{r echo = T, eval=T}
lm2stats$adj.r.squared
```
So adding BMI does not help explain any of the variability in birth weight (since r2 was
previously already 0.96). This is also confirmed by visualization:
```{r echo = T, eval=T}
plot(R_data$BMI, R_data$birthweight)
```
Note that the summary function returns a lot of information. If, for example, you wanted
to extract only the p-values you could do the following:
```{r echo = T, eval=T}
names(lm2stats)
lm2stats$coef[,4]
```
We can predict birthweights with new data:
```{r echo = T, eval=T}
predict(lm2, newdata=data.frame(pregnancy_length_weeks=32,
BMI=30))
```
But we cannot forget to check assumptions!
```{r echo = T, eval=T}
par(mfrow=c(2,2))
plot(lm2)
par(mfrow=c(1,1))
hist(lm2$residuals)
```
### Logistic regression
There are many research topics for with the dependent variable y is binary (0/1), e.g.
- mortality (dead/alive)
- treatment response (responder/non-responder)
- development of disease (yes/no)
and we want to predict the membership of an individual to one of the two categories
based on a set of predictors.
In this situation we have to work with probabilities, which are numbers between 0 and
1. A value $P(y)$ that is close to 0 means that y is very unlikely to occur, while a value
close to 1 means that y is very likely to occur.
In simple/multiple linear regression a continuous variable y is predicted by continuous/categorical x(s), but if y can only have 2 values (e.g. y = 0 or y = 1), how do we predict the probability that y = 1 given one or more predictors? Could we apply a linear regression model...?
we would try to fit $P(y = 1) = \beta_0 + \beta_1X$, which doesn't work. In linear regression we
assume the relationship between x and y is linear, but in the case of binary outcomes this
assumption is no longer valid. Results obtained from a linear regression model wouldn't
makes sense! Probabilities beyond the interval (0,1) are not interpretable.
Instead we're going to use a logistic curve:
$ln(\frac{p}{1-p}) = \beta_0 + \beta_1X$
The function $ln(p/(1- p))$ is called a logit of p and it is this function of y that is linear
in x instead of y itself.
This model formulation assures that the predicted probability of an event falls between
0 and 1, unlike a linear regression model.
Also note that we've made no assumptions about linearity, normality or homoscedasticity!
The values of the intercept and slope are estimated using the maximum likelihood
method, which finds the values of the coefficients that make the observed data most
likely to occur.
Statistical significance of estimate coefficients is testing with the Wald test, which is
based on the $\chi^2$ distribution (though R calls it "z"). The goodness of fit is assessed
by deviance, which is based on the differences observed-expected principle. Also, the
Akaike information criterion (AIC) gives a measure of the quality of the model.
The coefficients are most usefully interpreted with the following:
$e^{\beta} \ \frac{\textrm{odds after a unit change}}{\textrm{original odds}}$
and when x is binary, $e^{\beta}$ is the odds ratio from the 2 by 2 contingency table!
In R:
```{r echo = T, eval=T}
R_data$StatusNew <- factor(R_data$Status,
levels = c("normal brain development", "intellectual disability"))
levels(R_data$StatusNew)
lr1 <- glm(StatusNew ~ BMI, family = binomial(logit), data = R_data)
summary(lr1)
```
so $logit(p) = -2.40591+0.08782* BMI$. Thus the probability of having a baby with
an intellectual disability when BMI is 32, is:
```{r echo = T, eval=T}
logit_p1 <- -2.40591+0.08782*32
logit_p1
```
Or
```{r echo = T, eval=T}
predict(lr1, newdata = data.frame(BMI = 32), se.fit = TRUE)
```
To obtain the probability of the event use `type = "response"`
```{r echo = T, eval=T}
predict(lr1, newdata = data.frame(BMI = 32), se.fit = TRUE, type = "response")
```
The coefficient b1 = 0.08782 can be exponentiated to obtain the odds ratio:
```{r echo = T, eval=T}
summary(lr1)$coef
b1 <- summary(lr1)$coef[2, 1]
b1
exp(b1)
```
Like multiple linear regression, we can add variables into the model here as well:
```{r echo = T, eval=T}
lr2 <- glm(StatusNew ~ BMI + smoking, family = binomial(logit), data = R_data)
summary(lr2)$coef
```
Thus the probability of having a baby with an intellectual disability when BMI is
32, and the mother is a smoker is:
```{r echo = T, eval=T}
mynewdata <- data.frame(BMI = 32, smoking = factor("yes"))
logit_p2 <- predict(lr2, newdata = mynewdata, type = "response")
logit_p2
```