cv.svy {surveyCV} | R Documentation |
CV for survey data
Description
This is a cross validation function designed for survey samples taken using a SRS, stratified, clustered, or clustered-and-stratified sampling design. Returns survey CV estimates of the mean loss for each model (MSE for linear models, or binary cross-entropy for logistic models).
Usage
cv.svy(
Data,
formulae,
nfolds = 5,
strataID = NULL,
clusterID = NULL,
nest = FALSE,
fpcID = NULL,
method = c("linear", "logistic"),
weightsID = NULL,
useSvyForFolds = TRUE,
useSvyForFits = TRUE,
useSvyForLoss = TRUE,
na.rm = FALSE
)
Arguments
Data |
Dataframe of dataset to be used for CV |
formulae |
Vector of formulas (as strings) for the GLMs to be compared in cross validation |
nfolds |
Number of folds to be used during cross validation, defaults to 5 |
strataID |
String of the variable name used to stratify during sampling, must be the same as in the dataset used |
clusterID |
String of the variable name used to cluster during sampling, must be the same as in the dataset used |
nest |
Specify nest = TRUE if clusters are nested within strata, defaults to FALSE |
fpcID |
String of the variable name used for finite population corrections, must
be the same as in the dataset used, see |
method |
String, must be either "linear" or "logistic", determines type of model fit during cross validation, defaults to linear |
weightsID |
String of the variable name in the dataset that contains sampling weights |
useSvyForFolds |
Specify useSvyForFolds = TRUE (default) to take svydesign into account when making folds; should not be set FALSE except for running simulations to understand the properties of surveyCV |
useSvyForFits |
Specify useSvyForFits = TRUE (default) to take svydesign into account when fitting models on training sets; should not be set FALSE except for running simulations to understand the properties of surveyCV |
useSvyForLoss |
Specify useSvyForLoss = TRUE (default) to take svydesign into account when calculating loss over test sets; should not be set FALSE except for running simulations to understand the properties of surveyCV |
na.rm |
Whether to drop cases with missing values when taking 'svymean' of test losses |
Details
If you have already created a svydesign
object or fitted a svyglm
,
you will probably prefer the convenience wrapper functions
cv.svydesign
or cv.svyglm
.
For models other than linear or logistic regression,
you can use folds.svy
or folds.svydesign
to generate
CV fold IDs that respect any stratification or clustering in the survey design.
You can then carry out K-fold CV as usual,
taking care to also use the survey design features and survey weights
when fitting models in each training set
and also when evaluating models against each test set.
Value
Object of class svystat
, which is a named vector of survey CV estimates of the mean loss
(MSE for linear models, or binary cross-entropy for logistic models) for each model,
with names ".Model_1", ".Model_2", etc. corresponding to the models provided in formulae
;
and with a var
attribute giving the variances.
See surveysummary
for details.
See Also
cv.svydesign
for a wrapper to use with a svydesign
object,
or cv.svyglm
for a wrapper to use with a svyglm
object
Examples
# Compare CV MSEs and their SEs under 3 linear models
# for a stratified sample and a one-stage cluster sample,
# using data from the `survey` package
library(survey)
data("api", package = "survey")
# stratified sample
cv.svy(apistrat, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, strataID = "stype", weightsID = "pw", fpcID = "fpc")
# one-stage cluster sample
cv.svy(apiclus1, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, clusterID = "dnum", weightsID = "pw", fpcID = "fpc")
# Compare CV MSEs and their SEs under 3 linear models
# for a stratified cluster sample with clusters nested within strata
data(NSFG_data)
library(splines)
cv.svy(NSFG_data, c("income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)"),
nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt")
# Logistic regression example, using the same stratified cluster sample;
# instead of CV MSE, we calculate CV binary cross-entropy loss,
# where (as with MSE) lower values indicate better fitting models
# (NOTE: na.rm=TRUE is not usually ideal;
# it's used below purely for convenience, to keep the example short,
# but a thorough analysis would look for better ways to handle the missing data)
cv.svy(NSFG_data, c("KnowPreg ~ ns(age, df = 1)",
"KnowPreg ~ ns(age, df = 2)",
"KnowPreg ~ ns(age, df = 3)"),
method = "logistic", nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt",
na.rm = TRUE)