-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.Rmd
243 lines (192 loc) · 8.89 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "A Definitive Guide to Tune and Combine H2O Models in R"
output: github_document
bibliography: references.bib
---
```{r setup, include=FALSE}
# Build website:
# rmarkdown::render("README.Rmd", "github_document", output_file = "./docs/index.md", clean = TRUE)
library("pander")
panderOptions('table.continues', '')
panderOptions('table.continues.affix', '')
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
> Building well-tuned H2O models with **random hyper-parameter search** and combining them using a **stacking** approach
This tutorial shows how to use **random search** [@bergstra-2012] for hyper-parameter tuning in [H2O](https://github.com/h2oai) models and how to combine the well-tuned models using the **stacking / super learning** framework [@ledell-2015].
We focus on generating level-one data for a multinomial classification dataset from a famous [Kaggle](https://www.kaggle.com/) challenge, the [Otto Group Product Classification](https://www.kaggle.com/c/otto-group-product-classification-challenge) challenge. The dataset contains 61878 training instances and 144368 test instances on 93 numerical features. There are 9 categories for all data instances.
All experiments were conducted in a **64-bit Ubuntu 16.04.1 LTS** machine with **Intel Core i7-6700HQ 2.60GHz** and **16GB RAM DDR4**. We use `R` version **3.3.1** and `h2o` package version **3.10.0.9**.
The source code and all output files are available on [GitHub](https://github.com/davpinto/h2o-tutorial).
## Repository Structure
When you are conducting a big experiment it's very important to use a clear and robust repository structure, as follows:
```
root
│ README.md
│ project-name.Rproj
│
└── data
│ │ train.csv.zip
│ │ test.csv.zip
│ │ main.R
│ │...
│
└── gbm
│ │ main.R
│ │ gbm_output.csv.zip
│ │ gbm_model
│ │...
│
└── glm
│ │ main.R
│ │ glm_output.csv.zip
│ │ glm_model
│ │...
│
...
```
In the `root` directory we save a `README.md` file describing the experiment, and a [RStudio project](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) if we are using the **RStudio IDE** (strong recommended). In the `data` folder we save the data files and a `R` script to read them to the memory. Then we create a separated folder for each machine learning algorithm, where we store the `R` scripts to run it and the generated outputs like predictions and fitted models.
## Split Data in k-Folds
The first step is to split data in folds. We will use k-fold cross-validation for **parameter tuning** and then to generate **level-one** data to be used in the **stacking** step. All algorithms will use the same fold ids. So, we generate them using the `caret` package and save the results in the `./data/` folder. Here we use `k = 5`.
We have fixed the random generator with `set.seed(2020)` to allow reproducibility.
```{r, eval=FALSE, results='h ide'}
## Load required packages
library("readr")
library("caret")
## Read training data
tr.data <- readr::read_csv("./data/train.csv.zip")
y <- factor(tr.data$target, levels = paste("Class", 1:9, sep = "_"))
## Create stratified data folds
nfolds <- 5
set.seed(2020)
folds.id <- caret::createFolds(y, k = nfolds, list = FALSE)
set.seed(2020)
folds.list <- caret::createFolds(y, k = nfolds, list = TRUE)
save("folds.id", "folds.list", file = "./data/cv_folds.rda",
compress = "bzip2")
```
## Import Data to H2O
```{r, eval=FALSE, results='hide'}
## Load required packages
library("h2o")
library("magrittr")
## Instantiate H2O cluster
h2o.init(max_mem_size = '8G', nthreads = 6)
h2o.removeAll()
## Load training and test data
label.name <- 'target'
train.hex <- h2o.importFile(
path = normalizePath("./data/train.csv.zip"),
destination_frame = 'train_hex'
)
train.hex[,label.name] <- h2o.asfactor(train.hex[,label.name])
test.hex <- h2o.importFile(
path = normalizePath("./data/test.csv.zip"),
destination_frame = 'test_hex'
)
input.names <- h2o.colnames(train.hex) %>% setdiff(c('id', label.name))
## Assign data folds
load('./data/cv_folds.rda')
train.hex <- h2o.cbind(train.hex, as.h2o(data.frame('cv' = folds.id),
destination_frame = 'fold_idx'))
h2o.colnames(train.hex)
```
## Tuning GBM
For more details about GBM parameters take a look at this tutorial [Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/). There is also a great tutorial showing how to build a well-tuned H2O GBM model, the [H2O GBM Tuning Tutorial for R](http://blog.h2o.ai/2016/06/h2o-gbm-tuning-tutorial-for-r/).
### Random Parameter Search
```{r, eval=FALSE, results='hide'}
## Random search for parameter tuning
gbm.params <- list(
max_depth = seq(2, 24, by = 2),
min_rows = seq(10, 150, by = 10), # minimum observations required in a terminal node or leaf
sample_rate = seq(0.1, 1, by = 0.1), # row sample rate per tree (boostrap = 0.632)
col_sample_rate = seq(0.1, 1, by = 0.1), # column sample rate per split
col_sample_rate_per_tree = seq(0.1, 1, by = 0.1),
nbins = round(2 ^ seq(2, 6, length = 15)), # number of levels for numerical features discretization
histogram_type = c("UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin")
)
gbm.grid <- h2o.grid(
algorithm = "gbm", grid_id = "gbm_grid",
x = input.names, y = label.name, training_frame = train.hex,
fold_column = "cv", distribution = "multinomial", ntrees = 500,
learn_rate = 0.1, learn_rate_annealing = 0.995,
stopping_rounds = 2, stopping_metric = 'logloss', stopping_tolerance = 1e-5,
score_each_iteration = FALSE, score_tree_interval = 10,
keep_cross_validation_predictions = TRUE,
seed = 2020, max_runtime_secs = 30 * 60,
search_criteria = list(
strategy = "RandomDiscrete", max_models = 25,
max_runtime_secs = 12 * 60 * 60, seed = 2020
),
hyper_params = gbm.params
)
```
### Select the Best Parameters
```{r, eval=FALSE, results='hide'}
## Get best model
grid.table <- h2o.getGrid("gbm_grid", sort_by = "logloss", decreasing = FALSE)@summary_table
save(grid.table, file = "./gbm/grid_table.rda", compress = "bzip2")
best.gbm <- h2o.getModel(grid.table$model_ids[1])
h2o.logloss(best.gbm@model$cross_validation_metrics)
h2o.saveModel(best.gbm, path = "./gbm", force = TRUE)
file.rename(from = paste("gbm", grid.table$model_ids[1], sep = "/"), to = "gbm/best_model")
best.params <- best.gbm@allparameters
save(best.params, file = "./gbm/best_params.rda", compress = "bzip2")
head(grid.table, 5)
```
```{r, echo=FALSE}
load("./gbm/grid_table.rda")
grid.table$logloss <- as.numeric(grid.table$logloss)
grid.table$logloss <- round(grid.table$logloss, 4)
knitr::kable(head(grid.table, 5))
```
### Generate Level-one Training Data
```{r, eval=FALSE}
## Get predictions for the training cv folds
var.names <- paste("gbm", 1:h2o.nlevels(train.hex[,label.name]), sep = "_")
gbm.train.hex <- h2o.getFrame(best.gbm@model$cross_validation_holdout_predictions_frame_id$name)
gbm.train.hex[,"predict"] <- NULL
colnames(gbm.train.hex) <- var.names
gbm.train.hex <- h2o.round(gbm.train.hex, 6)
gbm.train.hex <- h2o.cbind(gbm.train.hex, train.hex[,label.name])
write.csv(
as.data.frame(gbm.train.hex),
file = gzfile('./gbm/gbm_levone_train.csv.gz'),
row.names = FALSE
)
```
### Generate Level-one Test Data
```{r, eval=FALSE}
## Get predictions for the test set
gbm.test.hex <- predict(best.gbm, test.hex)
gbm.test.hex[,"predict"] <- NULL
gbm.test.hex <- h2o.round(gbm.test.hex, 6)
write.csv(
as.data.frame(gbm.test.hex),
file = gzfile('./gbm/gbm_levone_test.csv.gz'),
col.names = var.names,
row.names = FALSE
)
```
### Generate Test Predictions
```{r, eval=FALSE}
## Save output for the test set
gbm.out.hex <- h2o.cbind(test.hex[,"id"], gbm.test.hex)
write.csv(
as.data.frame(gbm.out.hex),
file = gzfile('./gbm/gbm_output.csv.gz'),
row.names = FALSE
)
```

**Top 20%** with a single GBM model.
## Tuning RandomForest
...
## Tuning DeepLearning
...
## Tuning GLM
...
## Tuning NaiveBayes
...
## Super Learner
The approach presented here allow you to combine H2O with other powerful machine learning libraries in `R` like [XGBoost](https://github.com/dmlc/xgboost/tree/master/R-package), [MXNet](https://github.com/dmlc/mxnet/tree/master/R-package), [FastKNN](https://github.com/davpinto/fastknn), and [caret](https://github.com/topepo/caret), through the level-one data in the `.csv` format. You can also use the level-one data with `Python` libraries like [scikit-learn](http://scikit-learn.org/) and [Keras](https://github.com/fchollet/keras).
We recommed the `R` package [h2oEnsemble](https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble) as an alternative to easily build stacked models with H2O algorithms.
## References