impute_hotdeck {simputation}R Documentation

Hot deck imputation

Description

Hot-deck imputation methods include random and sequential hot deck, k-nearest neighbours imputation and predictive mean matching.

Usage

impute_rhd(
  dat,
  formula,
  pool = c("complete", "univariate", "multivariate"),
  prob,
  backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")),
  ...
)

impute_shd(
  dat,
  formula,
  pool = c("complete", "univariate", "multivariate"),
  order = c("locf", "nocb"),
  backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")),
  ...
)

impute_pmm(
  dat,
  formula,
  predictor = impute_lm,
  pool = c("complete", "univariate", "multivariate"),
  ...
)

impute_knn(
  dat,
  formula,
  pool = c("complete", "univariate", "multivariate"),
  k = 5,
  backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")),
  ...
)

Arguments

dat

[data.frame], with variables to be imputed and their predictors.

formula

[formula] imputation model description (see Details below).

pool

[character] Specify donor pool when backend="simputation"

  • "complete". Only records for which the variables on the left-hand-side of the model formula are complete are used as donors. If a record has multiple missings, all imputations are taken from a single donor.

  • "univariate". Imputed variables are treated one by one and independently so the order of variable imputation is unimportant. If a record has multiple missings, separate donors are drawn for each missing value.

  • "multivariate". A donor pool is created for each missing data pattern. If a record has multiple missings, all imputations are taken from a single donor.

prob

[numeric] Sampling probability weights (passed through to sample). Must be of length nrow(dat).

backend

[character] Choose the backend for imputation. If backend="VIM" the variables used to sort the data (in case of sequential hot deck) may not coincide with imputed variables.

...

further arguments passed to VIM::hotdeck if VIM is chosen as backend, otherwise they are passed to

  • order for impute_shd and backend="simputation"

  • VIM::hotdeck for impute_shd and impute_rhd when backend="VIM".

  • VIM:kNN for impute_knn when backend="VIM"

  • The predictor function for impute_pmm.

order

[character] Last Observation Carried Forward or Next Observarion Carried Backward. Only for backend="simputation"

predictor

[function] Imputation to use for predictive part in predictive mean matching. Any of the impute_ functions of this package (it makes no sense to use a hot-deck imputation).

k

[numeric] Number of nearest neighbours to draw the donor from.

Model specification

Formulas are of the form

IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]

The left-hand-side of the formula object lists the variable or variables to be imputed. The interpretation of the independent variables on the right-hand-side depends on the imputation method.

If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.

Grouping using dplyr::group_by is also supported. If groups are defined in both the formula and using dplyr::group_by, the data is grouped by the union of grouping variables. Any missing value in one of the grouping variables results in an error.

Methodology

Random hot deck imputation with impute_rhd can be applied to numeric, categorical or mixed data. A missing value is copied from a sampled record. Optionally samples are taken within a group, or with non-uniform sampling probabilities. See Andridge and Little (2010) for an overview of hot deck imputation methods.

Sequential hot deck imputation with impute_rhd can be applied to numeric, categorical, or mixed data. The dataset is sorted using the ‘predictor variables’. Missing values or combinations thereof are copied from the previous record where the value(s) are available in the case of LOCF and from the next record in the case of NOCF.

Predictive mean matching with impute_pmm can be applied to numeric data. Missing values or combinations thereof are first imputed using a predictive model. Next, these predictions are replaced with observed (combinations of) values nearest to the prediction. The nearest value is the observed value with the smallest absolute deviation from the prediction.

K-nearest neighbour imputation with impute_knn can be applied to numeric, categorical, or mixed data. For each record containing missing values, the k most similar completed records are determined based on Gower's (1977) similarity coefficient. From these records the actual donor is sampled.

Using the VIM backend

The VIM package has efficient implementations of several popular imputation methods. In particular, its random and sequential hotdeck implementation is faster and more memory-efficient than that of the current package. Moreover, VIM offers more fine-grained control over the imputation process then simputation.

If you have this package installed, it can be used by setting backend="VIM" for functions supporting this option. Alternatively, one can set options(simputation.hdbackend="VIM") so it becomes the default.

Simputation will map the simputation call to a function in the VIM package. In particular:

By default, VIM's imputation functions add indicator variables to the original data to trace what values have been imputed. This is switched off by default for consistency with the rest of the simputation package, but it may be turned on again by setting imp_var=TRUE.

References

Andridge, R.R. and Little, R.J., 2010. A review of hot deck imputation for survey non-response. International statistical review, 78(1), pp.40-64.

Gower, J.C., 1971. A general coefficient of similarity and some of its properties. Biometrics, pp.857–871.

See Also

Other imputation: impute_cart(), impute_lm(), impute()


[Package simputation version 0.2.8 Index]