R: Nonparametric MLE deconvolution under heteroskedastic normal...

deconv_npmle {eivtools}

R Documentation

Nonparametric MLE deconvolution under heteroskedastic normal error

Description

Implements a version of the Rabe-Hesketh et al. (2003) algorithm for computing the nonparametric MLE of a univariate latent variable distribution from observed error-prone measures. Allows for normal heteroskedastic measurement error with variance that depends on the latent variable, such as with estimates of latent ability from item response theory models.

Usage

deconv_npmle(W, csem,
gridspec = list(fixed=FALSE, xmin=-5, xmax=5, numpoints=2000),
lambda = 0.0005, lltol = 1e-7, psmall = 0.00005,
discrete = FALSE, quietly = FALSE)

Arguments

`W`	Vector of observed measures, where `W[i]` is assumed to be an error-prone measure of a latent variable `X[i]`
`csem`	A function of a single variable `x` that returns the conditional standard deviation of `W` given `X=x`. It needs to be able to accept a vector input and return a vector of the same length as the input.
`gridspec`	A named list specifying the grid over which the NPMLE will be computed. It must have a logical component `fixed` indicating if the NPMLE grid is fixed (TRUE), or if the final grid is chosen by optimization over a candidate grid (FALSE). The default is FALSE. The remaining components of `gridspec` specify the grid in one of two mutually exclusive ways. In the first way, `gridspec` must contain elements `xmin` providing the minimum grid value, `xmax` providing the maximum grid value, and `numpoints` providing the desired number of points. In this case, the grid will be `numpoints` equally-spaced values ranging from `xmin` to `xmax`. In the second way, `gridspec` must contain an element `grid`, a numeric vector providing the actual desired grid. It must be an arbitrary sequence of increasing numeric values with no missing values.
`lambda`	Step size, on probability scale, in Rabe-Hesketh et al. (2003) algorithm. See reference for details.
`lltol`	Algorithm stops when the improvement to the log likelihood does not exceed `lltol.`
`psmall`	If a mass point in the estimated latent distribution evolves to have probability less than `psmall,` it gets dropped.
`discrete`	Not currently implemented.
`quietly`	If `FALSE` (the default), prints iteration-by-iteration progress.

Details

The assumed model is W = X + U where the conditional distribution of U given X = x is assumed to be normal with mean zero and standard deviation csem(x). The function uses W to estimate a discrete latent distribution for X that maximizes the likelihood of the observed data. The function optimizes the mass points (among a grid of candidate values) and the associated probabilities.

In the special case of homoskedastic error, the function csem must be defined such that when passed a vector of length n, it returns a vector of length n where each element is a common constant.

The default values of xmin and xmin in gridspec are generally appropriate only for a latent variable on a standardized scale with mean zero and variance one, and should be set to appropriate values given the scale of W.

Value

A list with elements

gridspec: Information about the initial grid
.history: Iteration-by-iteration evolution of the estimated distribution, if gridspec$fixed is FALSE. Otherwise it is an empty list
px: A dataframe providing the final NPMLE distribution. There are as many rows as there are mass points in the estimated distribution; fields described below
reliability: An estimate of the reliability of W, equal to the estimated variance of X divided by the sample variance of W
simex_varfuncs: A dataframe with as many rows as there are unique values of W, providing estimated plug-in variance functions to use for SIMEX data generation with latent heteroskedastic error as described in Lockwood and McCaffrey (forthcoming); see references. Fields described below

The fields of px are:

x: Location of mass point
csem: Value of function csem at mass point
p: probability at mass point
ll: log likelihood at solution
ex: Estimate of mean of latent distribution
varx: Estimate of variance of latent distribution

The fields of simex_varfuncs are:

W: Unique observed values w of W
gW: The square of csem evaluated at W = w
gEXW: The square of csem evaluated at E[X | W=w], the conditional mean of X given W=w
EgXW: The conditional mean of the square of csem of X given W=w, equal to E[g(X) | W=w]

Author(s)

J.R. Lockwood jrlockwood@ets.org

References

Lockwood J.R. and McCaffrey D.F. (2014). “Correcting for test score measurement error in ANCOVA models for estimating treatment effects,” Journal of Educational and Behavioral Statistics 39(1):22-52.

Lockwood J.R. and McCaffrey D.F. (2017). “Simulation-extrapolation with latent heteroskedastic variance,” Psychometrika 82(3):717-736.

Rabe-Hesketh S., Pickles A. and Skrondal A. (2003). “Correcting for covariate measurement error in logistic regression using nonparametric maximum likelihood estimation,” Statistical Modelling 3:215-232.

Examples

data(testscores)

## get the unique values of the lag 1 math score and CSEM
## values and approximate the CSEM function using approxfun()
tmp <- unique(testscores[,c("math_lag1","math_lag1_csem")])
print(tmp <- tmp[order(tmp$math_lag1),])

.csem <- approxfun(tmp$math_lag1, tmp$math_lag1_csem, rule=2:2)
plot(tmp$math_lag1, tmp$math_lag1_csem)
lines(tmp$math_lag1, .csem(tmp$math_lag1), col="blue")

## get NPMLE distribution of latent lag 1 math achievement
m <- deconv_npmle(W        = testscores$math_lag1,
                  csem     = .csem,
                  gridspec = list(fixed = FALSE,
                                  xmin = min(testscores$math_lag1),
                                  xmax = max(testscores$math_lag1),
                                  numpoints = 10000),
                  quietly  = TRUE)

print(m$px)

## estimated mean is approximately the mean of W, but
## the estimated variance is less than the variance of W,
## as it should be
print(c(empirical = mean(testscores$math_lag1),
        estimated = m$px$ex[1]))

print(c(empirical = var(testscores$math_lag1),
        estimated = m$px$varx[1]))

## estimated reliability of W:
print(m$reliability)

## if implementing SIMEX, simex_varfuncs provides plug-in
## options to use for the heteroskedastic error variance
## of each observed W
print(m$simex_varfuncs)

## simple "value-added" estimates of school effects on math,
## adjusting for measurement error in the lag 1 math score.
testscores$schoolid <- factor(testscores$schoolid)

meiv <- eivreg(math ~ math_lag1 + sped + frl + schoolid,
         data = testscores,
         reliability = c(math_lag1 = m$reliability),
         contrasts = list(schoolid = "contr.sum"))

print(summary(meiv))

## alternative deconvolution with fixed grid
m <- deconv_npmle(W        = testscores$math_lag1,
                  csem     = .csem,
                  gridspec = list(fixed = TRUE,
                                  xmin = min(testscores$math_lag1),
                                  xmax = max(testscores$math_lag1),
                                  numpoints = 40),
                  quietly  = TRUE)

print(m$px)

[Package eivtools version 0.1-8 Index]