split_data {qgcomp} | R Documentation |
Perform sample splitting
Description
This is a convenience function to split the input data into
two independent sets, possibly accounting for
single level clustering. These two sets can be used with
qgcomp.partials
to get "partial" positive/negative effect estimates
from the original data, where sample splitting is necessary to get valid confidence intervals
and p-values. Sample splitting is also useful for any sort of exploratory model selection, where
the training data can be used to select the model and the validation model used to
generate the final estimates (this process should not be iterative - e.g. no "checking" the
results in the validation data and then re-fitting, as this invalidates inference in the
validation set.) E.g. you could use the training data to select non-linear terms for the
model and then re-fit in validation data to get unbiased estimates.
Usage
split_data(data, cluster = NULL, prop.train = 0.4)
Arguments
data |
A data.frame for use in qgcomp fitting |
cluster |
NULL (default) or character value naming a cluster identifier in the data. This is to prevent observations from a single cluster being in both the training and validation data, which reduces the effectiveness of sample splitting. |
prop.train |
proportion of the original dataset (or proportion of the clusters identified via the 'cluster' parameter) that are used in the training data (default=0.4) |
Value
A list of the following type: list( trainidx = trainidx, valididx = valididx, traindata = traindata, validdata = validdata )
e.g. if you call spl = split_data(dat)
, then spl$traindata will contain
a 40% sample from the original data, spl$validdata will contain the other 60%
and spl$trainidx, spl$valididx will contain integer indexes that track the
row numbers (from the original data dat
) that have the training and validation
samples.
Examples
data(metals)
set.seed(1231124)
spl = split_data(metals)
Xnm <- c(
'arsenic','barium','cadmium','calcium','chromium','copper',
'iron','lead','magnesium','manganese','mercury','selenium','silver',
'sodium','zinc'
)
dim(spl$traindata) # 181 observations = 40% of total
dim(spl$validdata) # 271 observations = 60% of total
splitres <- qgcomp.partials(fun="qgcomp.glm.noboot", f=y~., q=4,
traindata=spl$traindata,validdata=spl$validdata, expnms=Xnm)
splitres
# also used to compare linear vs. non-linear fits (useful if you have enough data)
set.seed(1231)
spl = split_data(metals, prop.train=.5)
lin = qgcomp.glm.boot(f=y~., q=4, expnms=Xnm, B=5, data=spl$traindata)
nlin1 = qgcomp.glm.boot(f=y~. + I(manganese^2) + I(calcium^2), expnms=Xnm, deg=2,
q=4, B=5, data=spl$traindata)
nlin2 = qgcomp.glm.boot(f=y~. + I(arsenic^2) + I(cadmium^2), expnms=Xnm, deg=2,
q=4, B=5, data=spl$traindata)
AIC(lin);AIC(nlin1);AIC(nlin2)
# linear has lowest training AIC, so base final fit off that (and bootstrap not needed)
qgcomp.glm.noboot(f=y~., q=4, expnms=Xnm, data=spl$validdata)