resampleData {cSEM} | R Documentation |
Resample data
Description
Resample data from a data set using common resampling methods.
For bootstrap or jackknife resampling, package users usually do not need to
call this function but directly use resamplecSEMResults()
instead.
Usage
resampleData(
.object = NULL,
.data = NULL,
.resample_method = c("bootstrap", "jackknife", "permutation",
"cross-validation"),
.cv_folds = 10,
.id = NULL,
.R = 499,
.seed = NULL
)
Arguments
.object |
An R object of class cSEMResults resulting from a call to |
.data |
A |
.resample_method |
Character string. The resampling method to use. One of: "bootstrap", "jackknife", "permutation", or "cross-validation". Defaults to "bootstrap". |
.cv_folds |
Integer. The number of cross-validation folds to use. Setting
|
.id |
Character string or integer. A character string giving the name or
an integer of the position of the column of |
.R |
Integer. The number of bootstrap runs, permutation runs
or cross-validation repetitions to use. Defaults to |
.seed |
Integer or |
Details
The function resampleData()
is general purpose. It simply resamples data
from a data set according to the resampling method provided
via the .resample_method
argument and returns a list of resamples.
Currently, bootstrap
, jackknife
, permutation
, and cross-validation
(both leave-one-out (LOOCV) and k-fold cross-validation) are implemented.
The user may provide the data set to resample either explicitly via the .data
argument or implicitly by providing a cSEMResults objects to .object
in which case the original data used in the call that created the
cSEMResults object is used for resampling.
If both, a cSEMResults object and a data set via .data
are provided
the former is ignored.
As csem()
accepts a single data set, a list of data sets as well as data sets
that contain a column name used to split the data into groups,
the cSEMResults object may contain multiple data sets.
In this case, resampling is done by data set or group. Note that depending
on the number of data sets/groups provided this computation may be slower
as resampling will be repeated for each data set/group.
To split data provided via the .data
argument into groups, the column name or
the column index of the column containing the group levels to split the data
must be given to .id
. If data that contains grouping is taken from
a cSEMResults object, .id
is taken from the object information. Hence,
providing .id
is redundant in this case and therefore ignored.
The number of bootstrap or permutation runs as well as the number of
cross-validation repetitions is given by .R
. The default is
499
but should be increased in real applications. See e.g.,
Hesterberg (2015), p.380 for recommendations concerning
the bootstrap. For jackknife .R
is ignored as it is based on the N leave-one-out data sets.
Choosing resample_method = "permutation"
for ungrouped data causes an error
as permutation will simply reorder the observations which is usually not
meaningful. If a list of data is provided
each list element is assumed to represent the observations belonging to one
group. In this case, data is pooled and group adherence permuted.
For cross-validation the number of folds (k
) defaults to 10
. It may be
changed via the .cv_folds
argument. Setting k = 2
(not 1!) splits
the data into a single training and test data set. Setting k = N
(where N
is the
number of observations) produces leave-one-out cross-validation samples.
Note: 1.) At least 2 folds required (k > 1
); 2.) k
can not be larger than N
;
3.) If N/k
is not not an integer the last fold will have less observations.
Random number generation (RNG) uses the L'Ecuyer-CRMR RGN stream as implemented in the future.apply package (Bengtsson 2018). See ?future_lapply for details. By default a random seed is chosen.
Value
The structure of the output depends on the type of input and the resampling method:
- Bootstrap
If a
matrix
ordata.frame
without grouping variable is provided (i.e.,.id = NULL
), the result is a list of length.R
(default499
). Each element of that list is a bootstrap (re)sample. If a grouping variable is specified or a list of data is provided (where each list element is assumed to contain data for one group), resampling is done by group. Hence, the result is a list of length equal to the number of groups with each list element containing.R
bootstrap samples based on theN_g
observations of groupg
.- Jackknife
If a
matrix
ordata.frame
without grouping variable is provided (.id = NULL
), the result is a list of length equal to the number of observations/rows (N
) of the data set provided. Each element of that list is a jackknife (re)sample. If a grouping variable is specified or a list of data is provided (where each list element is assumed to contain data for one group), resampling is done by group. Hence, the result is a list of length equal to the number of group levels with each list element containingN
jackknife samples based on theN_g
observations of groupg
.- Permutation
If a
matrix
ordata.frame
without grouping variable is provided an error is returned as permutation will simply reorder the observations. If a grouping variable is specified or a list of data is provided (where each list element is assumed to contain data of one group), group membership is permuted. Hence, the result is a list of length.R
where each element of that list is a permutation (re)sample.- Cross-validation
If a
matrix
ordata.frame
without grouping variable is provided a list of length.R
is returned. Each list element contains a list containing thek
splits/folds subsequently used as test and training data sets. If a grouping variable is specified or a list of data is provided (where each list element is assumed to contain data for one group), cross-validation is repeated.R
times for each group. Hence, the result is a list of length equal to the number of groups, each containing.R
list elements (the repetitions) which in turn contain thek
splits/folds.
References
Bengtsson H (2018).
future.apply: Apply Function to Elements in Parallel using Futures.
R package version 1.0.1, https://CRAN.R-project.org/package=future.apply.
Hesterberg TC (2015).
“What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum.”
The American Statistician, 69(4), 371–386.
doi:10.1080/00031305.2015.1089789.
See Also
csem()
, cSEMResults, resamplecSEMResults()
Examples
# ===========================================================================
# Using the raw data
# ===========================================================================
### Bootstrap (default) -----------------------------------------------------
res_boot1 <- resampleData(.data = satisfaction)
str(res_boot1, max.level = 3, list.len = 3)
## To replicate a bootstrap draw use .seed:
res_boot1a <- resampleData(.data = satisfaction, .seed = 2364)
res_boot1b <- resampleData(.data = satisfaction, .seed = 2364)
identical(res_boot1, res_boot1a) # TRUE
### Jackknife ---------------------------------------------------------------
res_jack <- resampleData(.data = satisfaction, .resample_method = "jackknife")
str(res_jack, max.level = 3, list.len = 3)
### Cross-validation --------------------------------------------------------
## Create dataset for illustration:
dat <- data.frame(
"x1" = rnorm(100),
"x2" = rnorm(100),
"group" = sample(c("male", "female"), size = 100, replace = TRUE),
stringsAsFactors = FALSE)
## 10-fold cross-validation (repeated 100 times)
cv_10a <- resampleData(.data = dat, .resample_method = "cross-validation",
.R = 100)
str(cv_10a, max.level = 3, list.len = 3)
# Cross-validation can be done by group if a group identifyer is provided:
cv_10 <- resampleData(.data = dat, .resample_method = "cross-validation",
.id = "group", .R = 100)
## Leave-one-out-cross-validation (repeated 50 times)
cv_loocv <- resampleData(.data = dat[, -3],
.resample_method = "cross-validation",
.cv_folds = nrow(dat),
.R = 50)
str(cv_loocv, max.level = 2, list.len = 3)
### Permuation ---------------------------------------------------------------
res_perm <- resampleData(.data = dat, .resample_method = "permutation",
.id = "group")
str(res_perm, max.level = 2, list.len = 3)
# Forgetting to set .id causes an error
## Not run:
res_perm <- resampleData(.data = dat, .resample_method = "permutation")
## End(Not run)
# ===========================================================================
# Using a cSEMResults object
# ===========================================================================
model <- "
# Structural model
QUAL ~ EXPE
EXPE ~ IMAG
SAT ~ IMAG + EXPE + QUAL + VAL
LOY ~ IMAG + SAT
VAL ~ EXPE + QUAL
# Measurement model
EXPE =~ expe1 + expe2 + expe3 + expe4 + expe5
IMAG =~ imag1 + imag2 + imag3 + imag4 + imag5
LOY =~ loy1 + loy2 + loy3 + loy4
QUAL =~ qual1 + qual2 + qual3 + qual4 + qual5
SAT =~ sat1 + sat2 + sat3 + sat4
VAL =~ val1 + val2 + val3 + val4
"
a <- csem(satisfaction, model)
# Create bootstrap and jackknife samples
res_boot <- resampleData(a, .resample_method = "bootstrap", .R = 499)
res_jack <- resampleData(a, .resample_method = "jackknife")
# Since `satisfaction` is the dataset used the following approaches yield
# identical results.
res_boot_data <- resampleData(.data = satisfaction, .seed = 2364)
res_boot_object <- resampleData(a, .seed = 2364)
identical(res_boot_data, res_boot_object) # TRUE