repMean {eatRep} | R Documentation |
Replication methods (JK1, JK2 and BRR) for descriptive statistics.
Description
Compute totals, means, adjusted means, mean differences, variances and standard deviations
with standard errors in random or clustered or complex samples. Variance estimation in complex cluster
designs based on Jackknife (JK1, JK2) or Balanced Repeated Replicates (BRR) procedure. Moreover, analyses
can be customized for multiple or nested imputed variables, applying the combination rules of Rubin (1987)
for imputed data and Rubin (2003) for nested imputed data. Conceptually, the function combines replication
methods and methods for multiple imputed data. Trend estimation as usual in large-scale assessments is supported as well.
Technically, this is a wrapper for the svymean
and svyvar
functions of the survey
package.
Usage
repMean (datL, ID, wgt = NULL, type = c("none", "JK2", "JK1", "BRR", "Fay"), PSU = NULL,
repInd = NULL, jkfac=NULL, repWgt = NULL, nest=NULL, imp=NULL, groups = NULL,
group.splits = length(groups), group.differences.by = NULL,
cross.differences = FALSE, crossDiffSE = c("wec", "rep","old"),
adjust = NULL, useEffectLiteR = FALSE, nBoot = 100, group.delimiter = "_",
trend = NULL, linkErr = NULL, dependent, na.rm = FALSE, doCheck = TRUE,
engine = c("survey", "BIFIEsurvey"), scale = 1, rscales = 1, mse=TRUE,
rho=NULL, hetero=TRUE, se_type = c("HC3", "HC0", "HC1", "HC2", "CR0", "CR2"),
clusters = NULL, crossDiffSE.engine= c("lavaan", "lm"),
stochasticGroupSizes = FALSE, verbose = TRUE, progress = TRUE)
Arguments
datL |
Data frame in the long format (i.e. each line represents one ID unit in one imputation of one nest) containing all variables for analysis. |
ID |
Variable name or column number of student identifier (ID) variable. ID variable must not contain any missing values. |
wgt |
Optional: Variable name or column number of weighting variable. If no weighting variable is specified, all cases will be equally weighted. |
type |
Defines the replication method for cluster replicates which is to be applied. Depending on |
PSU |
Variable name or column number of variable indicating the primary sampling unit (PSU). When a jackknife procedure is applied,
the PSU is the jackknife zone variable. If |
repInd |
Variable name or column number of variable indicating replicate ID. In a jackknife procedure, this is the jackknife replicate
variable. If |
jkfac |
Only applies if |
repWgt |
Normally, replicate weights are created by |
nest |
Optional: name or column number of the nesting variable. Only applies in nested multiple imputed data sets. |
imp |
Optional: name or column number of the imputation variable. Only applies in multiple imputed data sets. |
groups |
Optional: vector of names or column numbers of one or more grouping variables. |
group.splits |
Optional: If groups are defined, |
group.differences.by |
Optional: Specifies variable group differences should be computed for. The corresponding variable must be included in
the |
cross.differences |
Either a list of vectors, specifying the pairs of levels for which cross-level differences should be computed.
Alternatively, if |
crossDiffSE |
Method for standard error estimation for cross level differences, where groups are dependent.
|
adjust |
Variable name or column number of variable(s) for which adjusted means should be computed. Non-numeric variables (factors) will be converted to 0/1 dichotomous variables. |
useEffectLiteR |
Logical: use the |
nBoot |
Without replicates (i.e., for completely random samples), the |
group.delimiter |
Character string which separates the group names in the output frame. |
trend |
Optional: name or column number of the trend variable. Note: Trend variable must have exact two levels. Levels for grouping variables must be equal in both 'sub populations' partitioned by the trend variable. |
linkErr |
Optional: Either the name or column number of the linking error variable. If |
dependent |
Variable name or column number of the dependent variable. |
na.rm |
Logical: Should cases with missing values be dropped? |
doCheck |
Logical: Check the data for consistency before analysis? If |
engine |
Which package should be used for estimation? |
scale |
scaling constant for variance, for details, see help page of |
rscales |
scaling constant for variance, for details, see help page of |
mse |
Logical: If |
rho |
Shrinkage factor for weights in Fay's method. If |
hetero |
Logical: Assume heteroscedastic variance for weighted effect coding? |
se_type |
The sort of standard error sought for cross level differences. Only applies if |
clusters |
Optional: Variable name or column number of cluster variable. Only necessary if weighted effecting coding
should be performed using heteroscedastic variances. See the help page of |
crossDiffSE.engine |
Software implementation used for estimating cross-level differences. Choices are either |
stochasticGroupSizes |
Logical: Assume stochastic group sizes for using weighted effect coding in cross-level differences? Note: To date,
only |
verbose |
Logical: Show analysis information on console? |
progress |
Logical: Show progress bar on console? |
Details
Function first creates replicate weights based on PSU and repInd variables (if defined) according to JK2 or BRR procedure
as implemented in WesVar. According to multiple imputed data sets, a workbook with several analyses is created.
The function afterwards serves as a wrapper for svymean
called by svyby
implemented in
the ‘survey’ package. The results of the several analyses are then pooled according to Rubin's rule.
Value
A list of data frames in the long format. The output can be summarized using the report
function.
The first element of the list is a list with either one (no trend analyses) or two (trend analyses)
data frames with at least six columns each. For each subpopulation denoted by the groups
statement, each parameter (i.e., mean, variance, or group differences) and each coefficient (i.e., the estimate
and the corresponding standard error) the corresponding value is given.
group |
Denotes the group an analysis belongs to. If no groups were specified and/or analysis for the whole sample were requested, the value of ‘group’ is ‘wholeGroup’. |
depVar |
Denotes the name of the dependent variable in the analysis. |
modus |
Denotes the mode of the analysis. For example, if a JK2 analysis without sampling weights was conducted, ‘modus’ takes the value ‘jk2.unweighted’. If a analysis without any replicates but with sampling weights was conducted, ‘modus’ takes the value ‘weighted’. |
parameter |
Denotes the parameter of the regression model for which the corresponding value is given further. Amongst others, the ‘parameter’ column takes the values ‘mean’, ‘sd’, ‘var’ and ‘meanGroupDiff’ if group differences were requested. |
coefficient |
Denotes the coefficient for which the corresponding value is given further. Takes the values ‘est’ (estimate) and ‘se’ (standard error of the estimate). |
value |
The value of the parameter estimate in the corresponding group. |
If groups were specified, further columns which are denoted by the group names are added to the data frame.
References
te Grotenhuis, M., Pelzer, B., Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health. 62, 163–167.
Sachse, K. A. & Haag, N. (2017). Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Applied Measurement in Education, 30, (2), 102-116. http://dx.doi.org/10.1080/08957347.2017.1283315
Weirich, S., Hecht, M., Becker, B. et al. Comparing group means with the total mean in random samples, surveys, and large-scale assessments: A tutorial and software illustration. Behav Res (2021). https://doi.org/10.3758/s13428-021-01553-1
Examples
data(lsa)
### Example 1: only means, SD and variances for each country
### We only consider domain 'reading'
rd <- lsa[which(lsa[,"domain"] == "reading"),]
### We only consider the first "nest".
rdN1 <- rd[which(rd[,"nest"] == 1),]
### First, we only consider year 2010
rdN1y10<- rdN1[which(rdN1[,"year"] == 2010),]
### mean estimation
means1 <- repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
repInd = "jkrep", imp="imp", groups = "country", dependent = "score",
na.rm=FALSE, doCheck=TRUE, engine = "BIFIEsurvey")
### reporting function: the function does not know which content domain is being considered,
### so the user may add new columns in the output using the 'add' argument
res1 <- report(means1, add = list(domain = "reading"))
### Example 1a: Additionally to example 1, we decide to estimate whether
### each country's mean differ significantly from the overall mean as well
### as from the individual means of the other contries
means1a<- repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
repInd = "jkrep", imp="imp", groups = "country", group.splits = 0:1,
group.differences.by = "country", cross.differences = TRUE,
dependent = "score", na.rm=FALSE, doCheck=TRUE, hetero=FALSE)
res1a <- report(means1a, add = list(domain = "reading"))
### See that the means of 'countryA' and 'countryB' significantly differ from the overall mean.
print(res1a[intersect(which(res1a[,"comparison"] == "crossDiff"),
which(res1a[,"parameter"] == "mean")),], digits = 3)
### Example 2: Sex differences by country. Assume equally weighted cases by omitting
### 'wgt' argument.
means2 <- repMean(datL = rdN1y10, ID="idstud", type = "JK2", PSU = "jkzone",
repInd = "jkrep", imp="imp", groups = c("country", "sex"), group.splits = 0:2,
group.differences.by="sex", dependent = "score", na.rm=FALSE, doCheck=TRUE,
cross.differences =TRUE, crossDiffSE.engine= "lm")
res2 <- report(means2,add = list(domain = "reading"))
### Example 2a: Additionally to example 2, we decide to estimate whether
### each country's mean differ significantly from the overall mean. (Note: by default,
### such cross level differences are estimated using 'weighted effect coding'. Use the
### 'crossDiffSE' argument to choose alternative methods.) Moreover, we estimate whether
### each country's sex difference differ significantly from the sex difference in the
### whole population.
means2a<- repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
repInd = "jkrep", imp="imp", groups = c("country", "sex"), group.splits = 0:2,
group.differences.by="sex", cross.differences = list(c(0,1), c(0,2)),
dependent = "score", na.rm=FALSE, doCheck=TRUE,
crossDiffSE.engine= "lm", clusters = "idclass")
res2a <- report(means2a,add = list(domain = "reading"), trendDiffs = TRUE)
### Third example: like example 2a, but using nested imputations of dependent variable,
### and additionally estimating trend: use 'rd' instead of 'rdN1y10'
### assume equally weighted cases by omitting 'wgt' argument
### ignoring jackknife by omitting 'type', 'PSU' and 'repInd' argument
means3T<- repMean(datL = rd, ID="idstud", imp="imp", nest="nest",
groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex",
cross.differences = list(c(0,1), c(0,2)), dependent = "score", na.rm=FALSE,
doCheck=TRUE, trend = "year", linkErr = "leScore",
crossDiffSE = "wec", crossDiffSE.engine= "lavaan")
res3T <- report(means3T, add = list(domain = "reading"))
### Example 3a: like example 3, but providing linking errors in an additional data.frame
### This is optional for two measurement occasions but mandatory if the analysis contains
### more than two measurement occasions
linkErr<- data.frame ( trendLevel1 = 2010, trendLevel2 = 2015, depVar = "score",
parameter = "mean", unique(lsa[,c("domain", "leScore")]),
stringsAsFactors = FALSE)
colnames(linkErr) <- car::recode(colnames(linkErr), "'leScore'='linkingError'")
### note that the linking errors for the specified domain have to be chosen via
### subsetting
means3a<- repMean(datL = rd, ID="idstud", imp="imp", nest="nest",
groups = c("country", "sex"),
group.splits = 0:2, group.differences.by="sex",
cross.differences = list(c(0,1), c(0,2)),
dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year",
linkErr = linkErr[which(linkErr[,"domain"] == "reading"),],
crossDiffSE = "wec", crossDiffSE.engine= "lavaan")
res3a <- report(means3a, add = list(domain = "reading"))
### Fourth example: using a loop do analyse 'reading' and 'listening' comprehension
### in one function call. Again with group and cross differences and trends, and
### trend differences
### we use weights but omit jackknife analysis by omitting 'type', 'PSU' and 'repInd'
### argument
means4T<- by ( data = lsa, INDICES = lsa[,"domain"], FUN = function (sub.dat) {
repMean(datL = sub.dat, ID="idstud", wgt="wgt", imp="imp", nest="nest",
groups = c("country", "sex"), group.splits = 0:2,
group.differences.by="sex",
cross.differences = list(c(0,1), c(0,2)), dependent = "score",
na.rm=FALSE, doCheck=TRUE,
trend = "year", linkErr = "leScore", crossDiffSE.engine= "lm") })
ret4T <- do.call("rbind", lapply(names(means4T), FUN = function ( domain ) {
report(means4T[[domain]], trendDiffs = TRUE, add = list(domain = domain))}))
### Fifth example: compute adjusted means, also with trend estimation
### Note: all covariates must be numeric or 0/1 dichotomous
rdN1[,"mignum"] <- as.numeric(rdN1[,"mig"])
rdN1[,"sexnum"] <- car::recode(rdN1[,"sex"], "'male'=0; 'female'=1", as.numeric=TRUE,
as.factor=FALSE)
means5 <- repMean(datL = rdN1, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
repInd = "jkrep", imp="imp", groups = "country",
adjust = c("sexnum", "ses", "mignum"), useEffectLiteR = FALSE,
dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year",
linkErr = "leScore")
res5 <- report(means5, add = list(domain = "reading"))
## Not run:
############################################################################################
# Example 6: R code for running the PISA 2015 science example to compare group means #
# with the total mean using weighted effect coding #
############################################################################################
# Warning: large PISA data set requires at least 16 GB free working memory (RAM):
### define necessary directories (note: writing permissions required)
folder <- tempdir()
### download PISA 2015 zipped student questionnaire data (420 MB) to a folder with
### writing permissions
download.file(url = "https://webfs.oecd.org/pisa/PUF_SPSS_COMBINED_CMB_STU_QQQ.zip",
destfile = file.path(folder, "pisa2015.zip"))
### unzip PISA 2015 student questionnaire data (1.5 GB) to temporary folder
zip::unzip(zipfile = file.path(folder, "pisa2015.zip"), files= "CY6_MS_CMB_STU_QQQ.sav",
exdir=folder)
### read data
pisa <- foreign::read.spss(file.path (folder, "CY6_MS_CMB_STU_QQQ.sav"),
to.data.frame=TRUE, use.value.labels = FALSE, use.missings = TRUE)
# dependent variables
measure.vars <- paste0("PV", 1:10, "SCIE")
### choose desired variables and reshape into the long format
# 'CNTSTUID' = individual student identifier
# 'CNT' = country identifier
# 'SENWT' = senate weight (assume a population of 5000 in each country)
# 'W_FSTUWT' = final student weight
# 'OECD' = dummy variable indicating which country is part of the OECD
# 'W_FSTURWT' (1 to 80) = balanced repeated replicate weights
# 'PV1SCIE' to 'PV10SCIE' = 10 plausible values of (latent) science performance
pisaLong <- reshape2::melt(pisa, id.vars = c("CNTSTUID", "CNT", "SENWT", "W_FSTUWT",
"OECD", paste0("W_FSTURWT", 1:80)),
measure.vars = measure.vars, value.name = "value", variable.name="imp",
na.rm=TRUE)
### choose OECD countries
oecd <- pisaLong[which(pisaLong[,"OECD"] == 1),]
### analyze data
### analysis takes approximately 30 minutes on an Intel i5-6500 machine with 32 GB RAM
means <- repMean( datL = oecd, # data.frame in the long format
ID = "CNTSTUID", # student identifier
dependent = "value", # the dependent variable in the data
groups = "CNT", # the grouping variable
wgt = "SENWT", # (optional) weighting variable. We use senate
# weights (assume a population of 5000 in each
# country)
type = "Fay", # type of replication method. Corresponding to
# the PISA sampling method, we use "Fay"
rho = 0.5, # shrinkage factor for weights in Fay's method
scale = NULL, # scaling constant for variance, set to NULL
# according to PISA's sampling method
rscales = NULL, # scaling constant for variance, set to NULL
# according to PISA's sampling method
repWgt = paste0("W_FSTURWT", 1:80), # the replicate weights,
# provided by the OECD
imp = "imp", # the imputation variable
mse = FALSE, # if TRUE, compute variances based on sum of
# squares around the point estimate, rather
# than the mean of the replicates.
group.splits = 0:1, # defining the 'levels' for which means should
# be computed. 0:1 implies that means for the
# whole sample (level 0) as well as for groups
# (level 1) are computed
cross.differences = TRUE, # defines whether (and which) cross level mean
# differences should be computed. TRUE means
# that all cross level mean differences are
# computed
crossDiffSE = "wec", # method for standard errors of mean
# differences
crossDiffSE.engine = "lm", # software implementation for standard
# errors of mean differences
hetero = TRUE, # assume heteroscedastic group variances
stochasticGroupSizes = FALSE # assume fixed group sizes
)
### call a reporting function to generate user-friendly output
results <- report(means, exclude = c("Ncases", "NcasesValid", "var", "sd"))
## End(Not run)