-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAdvanced_statistics_machine_learning_cancer_regression.Rmd
528 lines (372 loc) · 13.6 KB
/
Advanced_statistics_machine_learning_cancer_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
---
title: "Advanced Statistics and Machine learning. Case study: Cancer death rate "
author: "Allan Kouidri"
date: "August 2022"
output:
pdf_document: default
---
# Advanced Statistics and Machine learning
This project aim using the cancer_reg.csv https://data.world/exercises/linear-regression-exercise-1/workspace/file?filename=cancer_reg.csv to predict "TARGET_deathRate".
Several models will be use such as ordinary least square, CART and Random Forest, and we will compare them
### On this dataset, we first perform an ordinary least square model to explain the target_deathrate variable thanks to the numerical ones.
```{r echo = T, results = 'hide', message=FALSE, warning=FALSE }
#Importing libraries
library(knitr)
library(missMDA)
library(glmnet)
library(MASS)
library(rpart)
library(rpart.plot)
library(randomForest)
library(VSURF)
```
```{r}
#Importing the dataset
data <-
read.csv(
file = "./cancer_reg.csv",
dec = ".",
header = TRUE
)
```
We first have a quick overview of the data sumarry statistic (output table not shown)
```{r echo = T, results = 'hide'}
summary(data)
```
From this summary we can see:
(i) We have two character variables: "binnedinc" and "geography". They will be
removed as we will work only with numerical ones.
(ii) We have missing values for three variables: pctsomecol18_24 pctsomecol18_24
and pctprivatecoveragealone
We will keep only numerical variables.
```{r}
#Removing variables
data2 <- data
data2$binnedinc <- NULL
data2$geography <- NULL
```
Now we will impute missing values.
```{r}
data3 <- data2
data3 <- as.data.frame(imputePCA(data2)$completeObs)
cat('number missing values in data2:',sum(is.na(data2)),'\n')
cat('number missing values in data3:',sum(is.na(data3)),'\n')
```
Let's look for potential outliers
```{r, figures-side, fig.show="hold", out.width="50%"}
model0 = lm(target_deathrate~.,data = data3)
plot(model0)
```
Looking at the model0 plot, Residuals vs Fitted, we have some potential
outliers: points 282,1221,1366.
We will suppress them.
```{r}
data4=data3[-c(282,1221,1366),]
```
Let's repeat the step after suppressing the selected observations.
```{r}
model0b = lm(target_deathrate~.,data = data4)
```
More precisely if we want to see some outliers we should consider the studentized
residuals.
```{r}
Rs=rstudent(model0b)
plot(model0b$fitted.values, Rs)
```
We see several points >2 and <-2 that we would would like to suppress them
from the studies.
```{r}
#Getting the list of the points to remove
Rs_df<-(as.data.frame((Rs)))
row.names(Rs_df)[which(Rs_df>2)]
row.names(Rs_df)[which(Rs_df< -2)]
```
```{r}
# Removing the columns
data5=data3[-c(1221,1366,282,2646, 31 , 79 , 116 , 122 ,166, 209 , 250 , 254 , 458 ,
466 , 469 , 472 ,
484 , 495 , 515 , 522 , 537 , 549 , 554 , 562 , 564 , 627 , 666 ,
670 , 690 , 727 , 775 , 780 , 786 , 975 , 979 , 1000 , 1076 , 1174
, 1204 , 1217 , 1221 , 1236 , 1261 , 1276 , 1297 , 1310 , 1316 ,
1390 , 1442 , 1497 , 1513 , 1542 , 1548 , 1856 , 1866 , 1882 , 1884 ,
1897 , 1914 , 1958 , 1962 , 2001 , 2016 , 2027 , 2036 , 2040 ,
2048 , 2079 , 2135 , 2174 , 2176 , 2267 , 2549 , 2563 , 2587 ,
2590 , 2596 , 2598 , 2600 , 2637 , 2673 , 2682 , 2714 , 2726 ,
2727 , 2757 , 2810 , 2812 , 2819 , 2825 , 2842 , 2858 , 3022 ,
3034 , 3036 , 3040 , 34 , 69 , 105 , 119 , 120 , 124 , 176 ,
189 , 256 , 264 , 415 , 476 , 514 , 556 , 616 , 621 , 625 ,
650 , 748 , 783 , 803 , 812 , 845 , 912 , 913 , 920 , 921 ,
925 , 1048 , 1058 , 1059 , 1130 , 1160 , 1195 , 1249 , 1290 ,
1311 , 1345 , 1405 , 1429 , 1445 , 1560 , 1568 , 1580 , 1686 ,
1701 , 1708 , 1777 , 1797 , 1942 , 1965 , 1969 , 2010 , 2018 ,
2051 , 2065 , 2066 , 2307 , 2311 , 2312 , 2318 , 2328 , 2344 ,
2351 , 2353 , 2386 , 2404 , 2427 , 2440 , 2444 , 2546 , 2593 ,
2626 , 2642 , 2659 , 2661 , 2669 , 2674 , 2696 , 2720 , 2734 ,
2741 , 2789 , 2809 , 2822 , 2985 , 2988 ),]
```
```{r}
#number of rows removed
nrow(data)-nrow(data5)
```
Let's check again for potential outliers
```{r}
model0c = lm(target_deathrate~.,data = data5)
Rsc=rstudent(model0c)
plot(model0c$fitted.values, Rsc)
```
Based on the plotting of the the studentized residuals, we do not see any
potential outliers. We can processed to the modeling.
We split our dataset into train and test.
```{r}
#Full dataset
set.seed(seed = 1703)
splitvector <-
sample(
x = c("learning", "test"),
size = nrow(x = data3),
replace = TRUE,
prob = c(0.7, 0.3)
)
learningset3 <- data3[splitvector == "learning", ]
testset3 <- data3[splitvector == "test", ]
```
```{r}
#Dataset with outliers removed
set.seed(seed = 1703)
splitvector <-
sample(
x = c("learning", "test"),
size = nrow(x = data5),
replace = TRUE,
prob = c(0.7, 0.3)
)
learningset5 <- data5[splitvector == "learning", ]
testset5 <- data5[splitvector == "test", ]
```
### Ordinary least square lm
We will test OLS models before and after removing potential outliers.
```{r}
model1 = lm(target_deathrate~.,data = learningset3)
p1 = predict(model1, newdata = testset3)
cat('OLS with potential outliers:',sqrt(mean((p1-testset3$target_deathrate)**2)),'\n')
```
```{r}
model1b = lm(target_deathrate~.,data = learningset5)
p1b = predict(model1b, newdata = testset5)
cat('OLS without outliers:',sqrt(mean((p1b-testset5$target_deathrate)**2)),'\n')
```
RMSE score after removing potential outliers is better. We will keep the
the learningset5 and testset5 for making futher models.
### Then we perform variable selection by using a step by step method at first and a penalized one then.
- Step by step:
```{r}
model2=step(model1)
p2=predict(model2, newdata = testset5)
```
```{r}
cat('Step default parameters:',sqrt(mean((p2-testset5$target_deathrate)^2)),'\n')
```
```{r}
stepforward=stepAIC(lm(target_deathrate~1,data=learningset5), scope=list(upper=model1b, lower=~1),test='F', direction = 'forward')
```
We will perform the step backward to see if it gives us the same final selection of explanatory variables
```{r}
stepbackward=stepAIC(model1b, scope = list(upper=model1b,lower=~1),test='F')
```
We can see that the list of selected variables for both forward and backward are identical.
```{r}
model2c = lm(target_deathrate ~ pctbachdeg25_over + incidencerate + povertypercent +
pctotherrace + pcths18_24 + birthrate + medianagemale + pctprivatecoverage +
pcths25_over + pctunemployed16_over + pctmarriedhouseholds +
percentmarried + pctemployed16_over + pctempprivcoverage +
pctwhite + pctnohs18_24 + medincome + pctpubliccoverage +
pctsomecol18_24 + avganncount + avgdeathsperyear + popest2015 +
medianagefemale,data = learningset5)
p2c= predict(model2c, newdata = testset5)
```
```{r}
cat('Step forward or backward:',sqrt(mean((p2c-testset5$target_deathrate)**2)),'\n')
```
### Lasso
```{r}
# lasso
x=as.matrix(learningset5[-3])
y=as.matrix(learningset5[3])
```
We determine the best value of lambda by cross validation
```{r}
tmplasso=cv.glmnet(x,y)
plot(tmplasso)
model3=glmnet(x,y,alpha=1,lambda=tmplasso$lambda.1se)
```
```{r}
p3_a=predict(model3, newx = as.matrix(testset5[,-1]), s = tmplasso$lambda.1se)
```
```{r}
cat('LASSO:',sqrt(mean((p3_a-testset5$target_deathrate)^2)),'\n')
```
LASSO should be used only as a variable selection feature :
Once the feature are selected, you have to estimate an OLS model on the selected feature
```{r}
#Selected variables
print(model3$beta)
```
```{r}
model3 = lm(target_deathrate~ avganncount + avgdeathsperyear + incidencerate +
povertypercent + studypercap + medianagemale + medianagefemale +
percentmarried + pctsomecol18_24 + pctbachdeg18_24 + pctwhite +
pcths18_24 + pcths25_over + pctbachdeg25_over + pctunemployed16_over +
pctprivatecoverage + pctempprivcoverage + pctotherrace +
pctmarriedhouseholds + birthrate, data = learningset5)
p3_b = predict(model3, newdata = testset5)
cat('OLS on LASSO selected features:',sqrt(mean((p3_b-testset5$target_deathrate)**2)),'\n')
```
### We perform also a CART algorithm, a model issue by random forest.
```{r}
# construction of the maximal tree
model4=rpart(target_deathrate~., data=learningset5,
control = rpart.control(minsplit=2, cp=10^(-15)))
```
We make sure we have the maximal tree
```{r}
Prediction4max=predict(model4)
err=sum((Prediction4max-learningset5$target_deathrate)^2)
err
```
Since err=0, we obtain the maximal tree
Lets perform the pruning step and the final selection
We get the information about the constructed subtree
```{r}
CP= model4$cptable
CP[1:5,]
```
```{r}
#The first thing is to find the smallest CV error
cvmin=min(CP[,4])
cvmin
# and to determin which row it corresponds to
r=which(CP[,4]==cvmin)
r
```
Now we will construct the final tree
```{r}
#The threshold for the 1-SE rule
t=CP[r,4]+1*CP[r,5]
z=which(CP[,4]<=t)
z_selected=z[1]
model4_prune=prune(model4, cp= CP[z_selected,1])
```
```{r}
p4_b=predict(model4_prune, newdata = testset5)
cat('CART final tree:',sqrt(mean((p4_b-testset5$target_deathrate)^2)),'\n')
```
### Let's identify thanks to VSURF the subset of interested variables and use this subset to construct a CART tree.
```{r echo = FALSE, autodep = TRUE, cache = TRUE, results = 'hide', message=FALSE, warning=FALSE }
# VSURF with default parameters
tmp=VSURF(x,y)
```
```{r, cache = TRUE,autodep = TRUE}
summary(tmp)
```
VSURF allows to perform variable selection by using Random Forest.
It is a two steps procedures:
Step: Suppression of the noisy variables
-After severals runs of RF, variables are sorted by the RF variable
importance (VI).
-Variable with small importance are eliminated, using a threshold given by
a CART model where the response variable is the sd associated to the VI
variables rank and explanatory variable will by the rank of the VI.
```{r, warning=FALSE}
plot(tmp, step = "thres", imp.sd = FALSE, var.names = TRUE)
```
The red line corresponds to the threshold value for VI. Only the
variables with an averaged VI above this level are kept. At this step,
only one variable has been eliminated.
```{r}
number <- c(1:30)
number[tmp$varselect.thres]
print(colnames(x)[tmp$varselect.thres])
```
However it doesn't mean the remaining variables is the correct subset of variables.
Step 2 : variable selection
First, We will identify the subset of explanatory variables which gives the model with
the lowest OOB (out of bag error). This will the variable selection for
interpretation.
```{r}
number[tmp$varselect.interp]
print(colnames(x)[tmp$varselect.interp])
```
We have now 18 variables selected
Finally, we have a step of testing, we want to validate that by adding these
variables, it is giving us additional information. So we might reduced the
number of potential selected variables. This test is based on the decrease
of OOB by adding a variable that should by larger than adding the average
variation obtained by adding noisy variables.
```{r}
number[tmp$varselect.pred]
print(colnames(x)[tmp$varselect.pred])
```
We end up with 15 selected variables.
Let's perfom a second CART tree using this subset of selected variables.
Perform also a CART algorithm, a model issue by random forest.
```{r}
# construction of the maximal tree
model5=rpart(target_deathrate~incidencerate + pctbachdeg25_over + pcths25_over +
medincome + pctemployed16_over + povertypercent +
pctpubliccoveragealone + pctunemployed16_over + pctpubliccoverage
+ pctblack + avgdeathsperyear + pctotherrace + medianagefemale +
pctmarriedhouseholds + percentmarried, data=learningset5,
control = rpart.control(minsplit=2, cp=10^(-15)))
```
We make sure we have the maximal tree
```{r}
Tree5max=predict(model5)
err=sum((Tree5max-learningset5$target_deathrate)^2)
err
```
Since err=0, we obtain the maximal tree
Lets perform the pruning step and the final selection
We get the information about the constructed subtree
```{r}
CP= model5$cptable
CP[1:5,]
```
```{r}
#The first thing is to find the smallest CV error
cvmin=min(CP[,4])
cvmin
# and to determin which row it corresponds to
r=which(CP[,4]==cvmin)
r
```
Now we will construct the final tree
```{r}
#The threshold for the 1-SE rule
t=CP[r,4]+1*CP[r,5]
z=which(CP[,4]<=t)
z_selected=z[1]
model5_pruned=prune(model5, cp= CP[z_selected,1])
```
```{r}
p5=predict(model5_pruned, newdata = testset5)
cat('CART + VSURF final tree:',sqrt(mean((p5-testset5$target_deathrate)^2)),'\n')
```
RandomForest
```{r}
model6=randomForest(target_deathrate~.,data=learningset5)
p6=predict(model6, newdata = testset5)
```
### Summary of the models
```{r}
cat('OLS full dataset (numericals):',sqrt(mean((p1-testset3$target_deathrate)**2)),'\n')
cat('OLS without outliers:',sqrt(mean((p1b-testset5$target_deathrate)**2)),'\n')
cat('Step default parameters:',sqrt(mean((p2-testset5$target_deathrate)^2)),'\n')
cat('Step forward or backward:',sqrt(mean((p2c-testset5$target_deathrate)**2)),'\n')
cat('LASSO:',sqrt(mean((p3_a-testset5$target_deathrate)^2)),'\n')
cat('OLS on LASSO selected features:',sqrt(mean((p3_b-testset5$target_deathrate)**2)),'\n')
cat('CART final tree:',sqrt(mean((p4_b-testset5$target_deathrate)^2)),'\n')
cat('CART + VSURF final tree:',sqrt(mean((p5-testset5$target_deathrate)^2)),'\n')
cat('Random Forest:',sqrt(mean((p6-testset5$target_deathrate)^2)),'\n')
```
According the RMSE, the best model is Random Forest followed by step method forward or backward.
To improve the analysis k-fold cross-validation could be performed for each model.