Nina Zumel, John Mount October 2019
These
are notes on controlling the cross-validation plan in the R
version
of vtreat
, for notes on the
Python
version of vtreat
,
please see
here.
First, try preparing this data using vtreat
.
By default, R
vtreat
uses a y
-stratified randomized k-way cross
validation when creating and evaluating complex synthetic variables.
Here we start with a simple k
-way cross validation plan. This will
work well for the majority of applications. However, there may be times
when you need a more specialized cross validation scheme for your
modeling projects. In this document, we’ll show how to replace the cross
validation scheme in vtreat
.
library(wrapr)
library(rqdatatable)
## Loading required package: rquery
library(vtreat)
As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:
n_row <- 1000
set.seed(2019)
d <- data.frame(
x = rnorm(n = n_row),
y = rbinom(n = n_row, size = 1, prob = 0.05)
)
summary(d)
## x y
## Min. :-3.23608 Min. :0.000
## 1st Qu.:-0.72730 1st Qu.:0.000
## Median :-0.13212 Median :0.000
## Mean :-0.07818 Mean :0.047
## 3rd Qu.: 0.59856 3rd Qu.:0.000
## Max. : 3.54146 Max. :1.000
First, try preparing this data using vtreat
.
#
# create the treatment plan
#
k <- 5 # number of cross-val folds
treatment_unstratified <- mkCrossFrameCExperiment(
d,
varlist = 'x',
outcomename = 'y',
outcometarget = 1,
ncross = k,
splitFunction = kWayCrossValidation,
verbose = FALSE)
# prepare the training data
prepared_unstratified = treatment_unstratified$crossFrame
Let’s look at the distribution of the target outcome in each of the cross-validation groups:
# convenience function to mark the cross-validation group of each row
label_rows <- function(d, cross_plan, label_column = 'group') {
d[label_column] = 0
for(i in 1:length(cross_plan)) {
app = cross_plan[[i]][['app']]
d[app, label_column] = i
}
return(d)
}
# label the rows
prepared_unstratified <- label_rows(prepared_unstratified, treatment_unstratified$evalSets)
# print(head(prepared_unstratified))
# get some summary statistics on the data
summarize_by_group <- local_td(prepared_unstratified) %.>%
project(.,
sum %:=% sum(y),
mean %:=% mean(y),
size %:=% n(),
groupby='group')
unstratified_summary <- prepared_unstratified %.>% summarize_by_group
unstratified_summary <- as.data.frame(unstratified_summary)
knitr::kable(unstratified_summary)
group | sum | mean | size |
---|---|---|---|
2 | 9 | 0.045 | 200 |
3 | 13 | 0.065 | 200 |
4 | 7 | 0.035 | 200 |
1 | 12 | 0.060 | 200 |
5 | 6 | 0.030 | 200 |
# standard deviation of target prevalence per cross-val fold
std_unstratified = sd(unstratified_summary[['mean']])
std_unstratified
## [1] 0.01524795
The target prevalence in the cross validation groups can vary fairly widely with respect to the “true” prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.
In this situation, vtreat
has an alternative cross-validation sampler
called kWayStratifiedY
that can be passed in as follows:
treatment_stratified <- mkCrossFrameCExperiment(
d,
varlist = 'x',
outcomename = 'y',
outcometarget = 1,
ncross = k,
splitFunction = kWayStratifiedY,
verbose = FALSE)
# prepare the training data
prepared_stratified = treatment_stratified$crossFrame
# examine the target prevalence
prepared_stratified = label_rows(prepared_stratified, treatment_stratified$evalSets)
stratified_summary <- prepared_stratified %.>% summarize_by_group
stratified_summary <- as.data.frame(stratified_summary)
knitr::kable(stratified_summary)
group | sum | mean | size |
---|---|---|---|
5 | 9 | 0.045 | 200 |
1 | 10 | 0.050 | 200 |
3 | 10 | 0.050 | 200 |
4 | 9 | 0.045 | 200 |
2 | 9 | 0.045 | 200 |
# standard deviation of target prevalence
std_stratified = sd(stratified_summary[['mean']])
std_stratified
## [1] 0.002738613
The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.
std_unstratified/std_stratified
## [1] 5.567764
If you want to cross-validate under another scheme–for example,
stratifying on the prevalences on an input class–you can write your own
custom cross-validation scheme and pass it into vtreat
in a similar
fashion as above. Your cross-validation scheme must have the same
signature as vtreat
’s
kWayCrossValidation
.
Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.
More notes on controlling vtreat
cross-validation can be found
here.
Note: it is important to not use leave-one-out cross-validation when
using nested or stacked modeling concepts (such as seen in vtreat
), we
have some notes on this
here.