-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathregression.html
386 lines (352 loc) · 12.3 KB
/
regression.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>The Regressomatic 4000</title>
<script src="//d3js.org/d3.v3.min.js" charset="utf-8"></script>
<link rel="stylesheet" href="regression.css">
<script src="sylvester/sylvester.js"></script>
<script src="distributions.js"></script>
<script src="regression.js"></script>
</head>
<body>
<h1>The Regressomatic 4000</h1>
<p><em>by</em> <a href="http://www.refsmmat.com">Alex Reinhart</a></p>
<p>
Once you have fit a regression model, your work is not finished. There are
a variety of diagnostic plots you can make to ensure the model fits
well. (There are also a range of diagnostic tests, but diagnostic plots
are easy to make and give more useful information.) This page is a brief
overview—not a textbook on diagnostics, but a brief introduction to
each type of plot.
</p>
<p>
The diagnostic plots on this page are interactive. You can click and drag
data points in the scatterplot to see what happens to the diagnostics, so
spend some time playing with the data to see what happens.
</p>
<h2>Why diagnostics?</h2>
<p>
There are a million different ways to fit a regression model. You can add
square or cubic terms, transform variables, weight data points, or drop
outliers. But first you need to know how well your model fits.
</p>
<p>
The classic example of the need to visualize your model fit is Anscombe's
Quartet, four scatterplots of data which have identical means, variances,
regression lines and correlation coefficients:
</p>
<figure>
<img style="width: 90%" src="img/anscombes-quartet.svg">
<figcaption>
Plot from <a
href="https://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg">Wikimedia
Commons</a>, by users Schutz and Avenue.
</figcaption>
</figure>
<p>
Only the top left plot looks like the regression line really fits the
data. Let's use Ancombe's quartet as examples as we look at regression
diagnostics.
</p>
<script>
// Anscombe's quartet, from R's "anscombe" dataset
var ans1 = [[10, 8.04],
[8, 6.95],
[13, 7.58],
[9, 8.81],
[11, 8.33],
[14, 9.96],
[6, 7.24],
[4, 4.26],
[12, 10.84],
[7, 4.82],
[5, 5.68]];
var ans2 = [[10, 9.14],
[8, 8.14],
[13, 8.74],
[9, 8.77],
[11, 9.26],
[14, 8.10],
[6, 6.13],
[4, 3.10],
[12, 9.13],
[7, 7.26],
[5, 4.74]];
var ans3 = [[10, 7.46],
[8, 6.77],
[13, 12.74],
[9, 7.11],
[11, 7.81],
[14, 8.84],
[6, 6.08],
[4, 5.39],
[12, 8.15],
[7, 6.42],
[5, 5.73]];
var ans4 = [[8, 6.58],
[8, 5.76],
[8, 7.71],
[8, 8.84],
[8, 8.47],
[8, 7.04],
[8, 5.25],
[19, 12.50],
[8, 5.56],
[8, 7.91],
[8, 6.89]];
// Plot styles
var opts = {
width: 390,
height: 350,
padding: 40,
ptRadius: 5,
hoverRadius: 8,
ptColor: "#FFA500",
ptBorderColor: "#F00",
hoverColor: "#FF4800"
};
</script>
<h2>Residuals</h2>
<p>
First, a simple regression and ordinary residuals. Click and drag any data
point to move it and see the change in the residuals.
</p>
<table>
<tr>
<th>Regression</th>
<th>Residuals</th>
</tr>
<tr>
<td id="ans1reg"></td>
<td id="ans1resid"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans1reg"), d3.select("#ans1resid"),
deepClone(ans1), opts, [3,16], [13, 3], "residuals",
"X1", "Y1");
</script>
<p>
Those look pretty reasonable. The residuals are a blob: no obvious
structure, spread evenly above and below zero. But suppose instead we try
the second dataset from Anscombe's quartet, which is really
quadratic. Then the residuals look very fishy indeed:
</p>
<table>
<tr>
<th>Regression</th>
<th>Residuals</th>
</tr>
<tr>
<td id="ans2reg"></td>
<td id="ans2resid"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans2reg"), d3.select("#ans2resid"),
deepClone(ans2), opts, [3,16], [13, 3], "residuals",
"X2", "Y2");
</script>
<p>
When the residuals appear curved or otherwise have a strange shape, your
model is missing something. Here we can see there's a quadratic trend left
over after the linear trend is fit, so we should add a quadratic term.
</p>
<h2>Standardized residuals</h2>
<p>
Next, let's <em>standardize</em> the residuals. We'll return to the first
dataset in the quartet. Try fiddling with the data: notice that as one
data point is pulled to become an outlier, the other residuals become
smaller, since the residual variance is increasing.
</p>
<table>
<tr>
<th>Regression</th>
<th>Standardized residuals</th>
</tr>
<tr>
<td id="ans1reg2"></td>
<td id="ans1standard"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans1reg2"), d3.select("#ans1standard"),
deepClone(ans1), opts, [3,16], [13, 3], "rstandard",
"X1", "Y1");
</script>
<h2>Studentized residuals</h2>
<p>
An alternate form of standardization leads to <em>Studentized</em>
residuals. These can be derived directly from the standardized residuals,
so they do not show any new information, but their interpretation is
clearer.
</p>
<p>
A Studentized residual can be seen as a <em>t</em> test of the null
hypothesis that the prediction for the given data point would not change
if the data point were left out of the fit. If the point is an outlier, we
can expect the test to reject the null.
</p>
<p>
Here are the Studentized residuals of the same dataset as before, for
comparison. The dashed lines indicate the rejection region of the
Bonferroni-corrected <em>t</em> test.
</p>
<table>
<tr>
<th>Regression</th>
<th>Studentized residuals</th>
</tr>
<tr>
<td id="ans1reg3"></td>
<td id="ans1student"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans1reg3"), d3.select("#ans1student"),
deepClone(ans1), opts, [3,16], [13, 3], "rstudent",
"X1", "Y1");
</script>
<h2>Cook's distance</h2>
<p>
Next, let's try Cook's distances. The Cook's distance of a data point
measures how much the model parameters would change if that point were
removed, so it is a very intuitive measure of influence. Notice that if
you vertically displace a point near the middle of the line, the distance
won't pass 0.6 no matter how hard you try, while a point at either end of
the line can easily have a distance over 1.
</p>
<table>
<tr>
<th>Regression</th>
<th>Cook's distances</th>
</tr>
<tr>
<td id="ans1reg4"></td>
<td id="ans1cook"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans1reg4"), d3.select("#ans1cook"),
deepClone(ans1), opts, [3,16], [13, 3], "cooks",
"X1", "Y1");
</script>
<p>
Cook's distance is useful when there are major outliers or influential
points. For example, it highlights the outlying point here, which pulls
the regression line disproportionately upwards. The other data points lie
almost exactly on a line, so they <em>individually</em> have very little
influence on the line.
</p>
<table>
<tr>
<th>Regression</th>
<th>Cook's distances</th>
</tr>
<tr>
<td id="ans3reg"></td>
<td id="ans3cook"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans3reg"), d3.select("#ans3cook"),
deepClone(ans3), opts, [3,16], [13, 3], "cooks",
"X3", "Y3");
</script>
<p>
The Cook's distance for the fourth plot in the quartet is even more
extreme—it's essentially infinite for the outlying point. All the
other points share the same X value, so the outlying point completely
determines the regression line, which always passes through it. It's as if
we only had two data points.
</p>
<h2>Leverage</h2>
<p>Next, leverage. Actually I'm not sure what it's good for.</p>
<table>
<tr>
<th>Regression</th>
<th>Leverage</th>
</tr>
<tr>
<td id="ans1reg5"></td>
<td id="ans1lev"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans1reg5"), d3.select("#ans1lev"),
deepClone(ans1), opts, [3,16], [13,3], "leverage",
"X1", "Y1");
</script>
<h2>Quantile-quantile plots</h2>
<p>
Quantile-quantile plots are a bit difficult to get intuition
about. They're designed to check if <em>residuals</em> are normally
distributed. (The marginal distribution of the data doesn't
matter—we're interested only in the distribution of the residuals.)
Of course, once your dataset becomes large enough, normality isn't an
important assumption—the linear regression estimators have
asymptotically normal sampling distributions regardless, so you can do the
standard tests on coefficients.
</p>
<p>
On the X axis of the Q-Q plot lie the expected values of the order
statistics of a normal distribution. On the Y axis are the standardized
residuals. Hence the smallest residual is compared against its expected
value, and so on. If the points deviate from a line, the normal
distribution is not a good fit for the residuals.
</p>
<table>
<tr>
<th>Regression</th>
<th>Q-Q plot</th>
</tr>
<tr>
<td id="ans1reg6"></td>
<td id="ans1qq"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans1reg6"), d3.select("#ans1qq"),
deepClone(ans1), opts, [3,16], [13,3], "qqnorm",
"X1", "Y1");
</script>
<p>
This data looks fairly normal—the sample size is small, so it can't
be expected to lie perfectly on the line. Try dragging data points around
to see what happens.
</p>
<p>
On the other hand, these residuals are hardly normal:
</p>
<table>
<tr>
<th>Regression</th>
<th>Q-Q plot</th>
</tr>
<tr>
<td id="ans3reg2"></td>
<td id="ans3qq"></td>
</tr>
</table>
<script>
regressionPlots(d3.select("#ans3reg2"), d3.select("#ans3qq"),
deepClone(ans3), opts, [3,16], [13,3], "qqnorm",
"X3", "Y3");
</script>
<footer>
<p><a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img
alt="Creative Commons License" style="border-width:0"
src="https://i.creativecommons.org/l/by/4.0/80x15.png" /></a><br /><span
xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The
Regressomatic</span> by <a xmlns:cc="http://creativecommons.org/ns#"
href="http://www.refsmmat.com" property="cc:attributionName"
rel="cc:attributionURL">Alex Reinhart</a> is licensed under a <a
rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative
Commons Attribution 4.0 International License</a>.</p>
<p>The Regressomatic is <a
href="https://github.com/capnrefsmmat/regressomatic">open-source</a>. Its
code is available under the MIT license.</p>
</footer>
</body>
</html>