buildEvalSets {vtreat} | R Documentation |
Build set carve-up for out-of sample evaluation.
Description
Return a carve-up of seq_len(nRows). Very useful for any sort of nested model situation (such as data prep, stacking, or super-learning).
Usage
buildEvalSets(
nRows,
...,
dframe = NULL,
y = NULL,
splitFunction = NULL,
nSplits = 3
)
Arguments
nRows |
scalar, >=1 number of rows to sample from. |
... |
no additional arguments, declared to forced named binding of later arguments. |
dframe |
(optional) original data.frame, passed to user splitFunction. |
y |
(optional) numeric vector, outcome variable (possibly to stratify on), passed to user splitFunction. |
splitFunction |
(optional) function taking arguments nSplits,nRows,dframe, and y; returning a user desired split. |
nSplits |
integer, target number of splits. |
Details
Also sets attribute "splitmethod" on return value that describes how the split was performed. attr(returnValue,'splitmethod') is one of: 'notsplit' (data was not split; corner cases like single row data sets), 'oneway' (leave one out holdout), 'kwaycross' (a simple partition), 'userfunction' (user supplied function was actually used), or a user specified attribute. Any user desired properties (such as stratification on y, or preservation of groups designated by original data row numbers) may not apply unless you see that 'userfunction' has been used.
The intent is the user splitFunction only needs to handle "easy cases" and maintain user invariants. If the user splitFunction returns NULL, throws, or returns an unacceptable carve-up then vtreat::buildEvalSets returns its own eval set plan. The signature of splitFunction should be splitFunction(nRows,nSplits,dframe,y) where nSplits is the number of pieces we want in the carve-up, nRows is the number of rows to split, dframe is the original dataframe (useful for any group control variables), and y is a numeric vector representing outcome (useful for outcome stratification).
Note that buildEvalSets may not always return a partition (such as one row dataframes), or if the user split function chooses to make rows eligible for application a different number of times.
Value
list of lists where the app portion of the sub-lists is a disjoint carve-up of seq_len(nRows) and each list as a train portion disjoint from app.
See Also
kWayCrossValidation
, kWayStratifiedY
, and makekWayCrossValidationGroupedByColumn
Examples
# use
buildEvalSets(200)
# longer example
# helper fns
# fit models using experiment plan to estimate out of sample behavior
fitModelAndApply <- function(trainData,applicaitonData) {
model <- lm(y~x,data=trainData)
predict(model,newdata=applicaitonData)
}
simulateOutOfSampleTrainEval <- function(d,fitApplyFn) {
eSets <- buildEvalSets(nrow(d))
evals <- lapply(eSets,
function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) })
pred <- numeric(nrow(d))
for(eii in seq_len(length(eSets))) {
pred[eSets[[eii]]$app] <- evals[[eii]]
}
pred
}
# run the experiment
set.seed(2352356)
# example data
d <- data.frame(x=rnorm(5),y=rnorm(5),
outOfSampleEst=NA,inSampleEst=NA)
# fit model on all data
d$inSampleEst <- fitModelAndApply(d,d)
# compute in-sample R^2 (above zero, falsely shows a
# relation until we adjust for degrees of freedom)
1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2)
d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply)
# compute out-sample R^2 (not positive,
# evidence of no relation)
1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d$y))^2)