sampleVADIR {sampleVADIR} | R Documentation |
Draw stratified samples from VADIR database
Description
Core function used to pull a stratified sample from VADIR based on a variety of parameters.
Usage
sampleVADIR(
data,
n = 4500,
vars = "all",
rankDat = "rankDat",
payRanks = 4,
post911 = TRUE,
dischargedAfter = FALSE,
until = NULL,
ageDischarge = TRUE,
ageEnlist = FALSE,
ageNow = FALSE,
yearsServed = FALSE,
dateformat = "%m/%d/%Y",
params = NULL,
formats = "default",
typos = list(),
rmDeviates = FALSE,
timeCats = FALSE,
saveData = TRUE,
onlyIDs = FALSE,
oversample = FALSE,
exclude = FALSE,
seed = NULL
)
Arguments
data |
VADIR dataset |
n |
Total desired sample size |
vars |
Character vector indicating which variables to use in stratification |
rankDat |
Dataset linking ranks to pay grade, or character string
indicating where to pull that dataset from. Recommended to leave as
|
payRanks |
Number of pay grades to use when converting rank variable. Only options are either 4 or 7. |
post911 |
Logical. Determines whether to only consider individuals deployed after 9/11/2001 |
dischargedAfter |
Character string indicating what date to restrict
sampling to based on discharge date. Can set to |
until |
Upper limit to when service was started. |
ageDischarge |
Logical. Determines whether to use age at discharge as a stratum. |
ageEnlist |
Logical. Determines whether to use age at enlist as a stratum. |
ageNow |
Logical. Determines whether to use current age as a stratum. |
yearsServed |
Logical. Determines whether to use total years served as a stratum. |
dateformat |
Character string indicating the expected date format. Should be automatically detected. |
params |
Optional list of parameters to override defaults in function. Creates an easy way to interface with the function if performing the stratification multiple times. Allows the user to avoid writing the same arguments multiple times. |
formats |
Should be |
typos |
List containing typos to be fixed, as well as what they should
be changed to. Leave at |
rmDeviates |
Logical. Determines whether rows with unexpected response
values are removed. If |
timeCats |
Logical or numeric. Determines whether the time-related
variables should be treated as categorical variables. If |
saveData |
Logical. Determines whether to save the full dataset in the output. Specifically, returns the full dataset of candidates (i.e., some people may be removed from consideration due to errors or unexpected responses). |
onlyIDs |
Logical. Determines whether to only return ID values for selected individuals rather than a full dataset. |
oversample |
Logical. Determines whether to oversample or undersample based on limitations due to available proportions of strata in subsample. |
exclude |
Logical. Determines whether to exclude people missing a zip
code, as well as people with |
seed |
Numeric value indicating the seed to set for the stratification procedure. Allows for reproducible results. |
Details
Performs stratification separately for males and females, where males and females are sampled at a 1:1 ratio, regardless of population ratio.
With a large dataset (which is typical for VADIR), setting any of the
date-related variables to TRUE
can drastically increase computation
time. The relevant arguments include: ageDischarge, ageEnlist, ageNow,
yearsServed
.
Value
A list containing the males and females who were sampled from VADIR
Examples
params <- list(
n = 7000,
vars = c('PN_Sex_CD', 'PN_BRTH_DT', 'SVC_CD', 'PNL_CAT_CD', 'RANK_CD',
'PNL_TERM_DT', 'PNL_BGN_DT', 'OMB_RACE_CD',
'OMB_ETHNC_NAT_ORIG_CD', 'POST_911_DPLY_IND_CD'),
rankDat = 'rankDat',
payRanks = 4,
post911 = FALSE,
until = NULL,
dischargedAfter = FALSE,
ageDischarge = TRUE,
ageEnlist = FALSE,
ageNow = FALSE,
yearsServed = FALSE,
dateformat = '%m/%d/%Y',
formats = 'default',
rmDeviates = FALSE,
timeCats = TRUE,
saveData = TRUE,
onlyIDs = FALSE,
oversample = TRUE,
exclude = FALSE,
typos = list()
)
out <- sampleVADIR(VADIR_fake, params = params, seed = 19)