resampleData {cSEM}  R Documentation 
Resample data from a data set using common resampling methods.
For bootstrap or jackknife resampling, package users usually do not need to
call this function but directly use resamplecSEMResults()
instead.
resampleData( .object = NULL, .data = NULL, .resample_method = c("bootstrap", "jackknife", "permutation", "crossvalidation"), .cv_folds = 10, .id = NULL, .R = 499, .seed = NULL )
.object 
An R object of class cSEMResults resulting from a call to 
.data 
A 
.resample_method 
Character string. The resampling method to use. One of: "bootstrap", "jackknife", "permutation", or "crossvalidation". Defaults to "bootstrap". 
.cv_folds 
Integer. The number of crossvalidation folds to use. Setting

.id 
Character string or integer. A character string giving the name or
an integer of the position of the column of 
.R 
Integer. The number of bootstrap runs, permutation runs
or crossvalidation repetitions to use. Defaults to 
.seed 
Integer or 
The function resampleData()
is general purpose. It simply resamples data
from a data set according to the resampling method provided
via the .resample_method
argument and returns a list of resamples.
Currently, bootstrap
, jackknife
, permutation
, and crossvalidation
(both leaveoneout (LOOCV) and kfold crossvalidation) are implemented.
The user may provide the data set to resample either explicitly via the .data
argument or implicitly by providing a cSEMResults objects to .object
in which case the original data used in the call that created the
cSEMResults object is used for resampling.
If both, a cSEMResults object and a data set via .data
are provided
the former is ignored.
As csem()
accepts a single data set, a list of data sets as well as data sets
that contain a column name used to split the data into groups,
the cSEMResults object may contain multiple data sets.
In this case, resampling is done by data set or group. Note that depending
on the number of data sets/groups provided this computation may be slower
as resampling will be repeated for each data set/group.
To split data provided via the .data
argument into groups, the column name or
the column index of the column containing the group levels to split the data
must be given to .id
. If data that contains grouping is taken from
a cSEMResults object, .id
is taken from the object information. Hence,
providing .id
is redundant in this case and therefore ignored.
The number of bootstrap or permutation runs as well as the number of
crossvalidation repetitions is given by .R
. The default is
499
but should be increased in real applications. See e.g.,
Hesterberg (2015), p.380 for recommendations concerning
the bootstrap. For jackknife .R
is ignored as it is based on the N leaveoneout data sets.
Choosing resample_method = "permutation"
for ungrouped data causes an error
as permutation will simply reorder the observations which is usually not
meaningful. If a list of data is provided
each list element is assumed to represent the observations belonging to one
group. In this case, data is pooled and group adherence permutated.
For crossvalidation the number of folds (k
) defaults to 10
. It may be
changed via the .cv_folds
argument. Setting k = 2
(not 1!) splits
the data into a single training and test data set. Setting k = N
(where N
is the
number of observations) produces leaveoneout crossvalidation samples.
Note: 1.) At least 2 folds required (k > 1
); 2.) k
can not be larger than N
;
3.) If N/k
is not not an integer the last fold will have less observations.
Random number generation (RNG) uses the L'EcuyerCRMR RGN stream as implemented in the future.apply package (Bengtsson 2018). See ?future_lapply for details. By default a random seed is chosen.
The structure of the output depends on the type of input and the resampling method:
If a matrix
or data.frame
without grouping variable
is provided (i.e., .id = NULL
), the result is a list of length .R
(default 499
). Each element of that list is a bootstrap (re)sample.
If a grouping variable is specified or a list of data is provided
(where each list element is assumed to contain data for one group),
resampling is done by group. Hence,
the result is a list of length equal to the number of groups
with each list element containing .R
bootstrap samples based on the
N_g
observations of group g
.
If a matrix
or data.frame
without grouping variable
is provided (.id = NULL
), the result is a list of length equal to the number
of observations/rows (N
) of the data set provided.
Each element of that list is a jackknife (re)sample.
If a grouping variable is specified or a list of data is provided
(where each list element is assumed to contain data for one group),
resampling is done by group. Hence,
the result is a list of length equal to the number of group levels
with each list element containing N
jackknife samples based on the
N_g
observations of group g
.
If a matrix
or data.frame
without grouping variable
is provided an error is returned as permutation will simply reorder the observations.
If a grouping variable is specified or a list of data is provided
(where each list element is assumed to contain data of one group),
group membership is permutated. Hence, the result is a list of length .R
where each element of that list is a permutation (re)sample.
If a matrix
or data.frame
without grouping variable
is provided a list of length .R
is returned. Each list element
contains a list containing the k
splits/folds subsequently
used as test and training data sets.
If a grouping variable is specified or a list of data is provided
(where each list element is assumed to contain data for one group),
crossvalidation is repeated .R
times for each group. Hence,
the result is a list of length equal to the number of groups,
each containing .R
list elements (the repetitions) which in turn contain
the k
splits/folds.
Bengtsson H (2018).
future.apply: Apply Function to Elements in Parallel using Futures.
R package version 1.0.1, https://CRAN.Rproject.org/package=future.apply.
Hesterberg TC (2015).
“What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum.”
The American Statistician, 69(4), 371–386.
doi: 10.1080/00031305.2015.1089789, https://doi.org/10.1080/00031305.2015.1089789.
csem()
, cSEMResults, resamplecSEMResults()
# =========================================================================== # Using the raw data # =========================================================================== ### Bootstrap (default)  res_boot1 < resampleData(.data = satisfaction) str(res_boot1, max.level = 3, list.len = 3) ## To replicate a bootstrap draw use .seed: res_boot1a < resampleData(.data = satisfaction, .seed = 2364) res_boot1b < resampleData(.data = satisfaction, .seed = 2364) identical(res_boot1, res_boot1a) # TRUE ### Jackknife  res_jack < resampleData(.data = satisfaction, .resample_method = "jackknife") str(res_jack, max.level = 3, list.len = 3) ### Crossvalidation  ## Create dataset for illustration: dat < data.frame( "x1" = rnorm(100), "x2" = rnorm(100), "group" = sample(c("male", "female"), size = 100, replace = TRUE), stringsAsFactors = FALSE) ## 10fold crossvalidation (repeated 100 times) cv_10a < resampleData(.data = dat, .resample_method = "crossvalidation", .R = 100) str(cv_10a, max.level = 3, list.len = 3) # Crossvalidation can be done by group if a group identifyer is provided: cv_10 < resampleData(.data = dat, .resample_method = "crossvalidation", .id = "group", .R = 100) ## Leaveoneoutcrossvalidation (repeated 50 times) cv_loocv < resampleData(.data = dat[, 3], .resample_method = "crossvalidation", .cv_folds = nrow(dat), .R = 50) str(cv_loocv, max.level = 2, list.len = 3) ### Permuation  res_perm < resampleData(.data = dat, .resample_method = "permutation", .id = "group") str(res_perm, max.level = 2, list.len = 3) # Forgetting to set .id causes an error ## Not run: res_perm < resampleData(.data = dat, .resample_method = "permutation") ## End(Not run) # =========================================================================== # Using a cSEMResults object # =========================================================================== model < " # Structural model QUAL ~ EXPE EXPE ~ IMAG SAT ~ IMAG + EXPE + QUAL + VAL LOY ~ IMAG + SAT VAL ~ EXPE + QUAL # Measurement model EXPE =~ expe1 + expe2 + expe3 + expe4 + expe5 IMAG =~ imag1 + imag2 + imag3 + imag4 + imag5 LOY =~ loy1 + loy2 + loy3 + loy4 QUAL =~ qual1 + qual2 + qual3 + qual4 + qual5 SAT =~ sat1 + sat2 + sat3 + sat4 VAL =~ val1 + val2 + val3 + val4 " a < csem(satisfaction, model) # Create bootstrap and jackknife samples res_boot < resampleData(a, .resample_method = "bootstrap", .R = 499) res_jack < resampleData(a, .resample_method = "jackknife") # Since `satisfaction` is the dataset used the following approaches yield # identical results. res_boot_data < resampleData(.data = satisfaction, .seed = 2364) res_boot_object < resampleData(a, .seed = 2364) identical(res_boot_data, res_boot_object) # TRUE