impute_lm {simputation} | R Documentation |
(Robust) Linear Regression Imputation
Description
Regression imputation methods including linear regression, robust
linear regression with M
-estimators, regularized regression
with lasso/elasticnet/ridge regression.
Usage
impute_lm(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
...
)
impute_rlm(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
...
)
impute_en(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
family = c("gaussian", "poisson"),
s = 0.01,
...
)
Arguments
dat |
|
formula |
|
add_residual |
|
na_action |
|
... |
further arguments passed to |
family |
Response type for elasticnet / lasso regression. For
|
s |
The value of |
Value
dat
, but imputed where possible.
Model specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to
be imputed. The right-hand side excluding the optional GROUPING_VARIABLES
model specification for the underlying predictor.
If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.
Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error.
Grouping is ignored for impute_const
.
Methodology
Linear regression model imputation with impute_lm
can be used
to impute numerical variables based on numerical and/or categorical
predictors. Several common imputation methods, including ratio and (group)
mean imputation can be expressed this way. See lm
for
details on possible model specification.
Robust linear regression through M-estimation with
impute_rlm
can be used to impute numerical variables employing
numerical and/or categorical predictors. In M
-estimation, the
minimization of the squares of residuals is replaced with an alternative
convex function of the residuals that decreases the influence of
outliers.
Also see e.g. Huber (1981).
Lasso/elastic net/ridge regression imputation with impute_en
can be used to impute numerical variables employing numerical and/or
categorical predictors. For this method, the regression coefficients are
found by minimizing the least sum of squares of residuals augmented with a
penalty term depending on the size of the coefficients. For lasso regression
(Tibshirani, 1996), the penalty term is the sum of squares of the
coefficients. For ridge regression (Hoerl and Kennard, 1970), the penalty
term is the sum of absolute values of the coefficients. Elasticnet regression
(Zou and Hastie, 2010) allows switching from lasso to ridge by penalizing by
a weighted sum of the sum-of-squares and sum of absolute values term.
References
Huber, P.J., 2011. Robust statistics (pp. 1248-1251). Springer Berlin Heidelberg.
Hoerl, A.E. and Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), pp.55-67.
Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp.267-288.
Zou, H. and Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), pp.301-320.
See Also
Getting started with simputation,
Other imputation:
impute_cart()
,
impute_hotdeck
,
impute()
Examples
data(iris)
irisNA <- iris
irisNA[1:4, "Sepal.Length"] <- NA
irisNA[3:7, "Sepal.Width"] <- NA
# impute a single variable (Sepal.Length)
i1 <- impute_lm(irisNA, Sepal.Length ~ Sepal.Width + Species)
# impute both Sepal.Length and Sepal.Width, using robust linear regression
i2 <- impute_rlm(irisNA, Sepal.Length + Sepal.Width ~ Species + Petal.Length)