R: k-Nearest Neighbour Imputation

kNN {VIM}

R Documentation

k-Nearest Neighbour Imputation

Description

k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.

Usage

kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE,
  methodStand = "range",
  ordFun = medianSamp
)

Arguments

`data`	data.frame or matrix
`variable`	variables where missing values should be imputed
`metric`	metric to be used for calculating the distances between
`k`	number of Nearest Neighbours used
`dist_var`	names or variables to be used for distance calculation
`weights`	weights for the variables for distance calculation. If `weights = "auto"` weights will be selected based on variable importance from random forest regression, using function `ranger::ranger()`. Weights are calculated for each variable seperately.
`numFun`	function for aggregating the k Nearest Neighbours in the case of a numerical variable
`catFun`	function for aggregating the k Nearest Neighbours in the case of a categorical variable
`makeNA`	list of length equal to the number of variables, with values, that should be converted to NA for each variable
`NAcond`	list of length equal to the number of variables, with a condition for imputing a NA
`impNA`	TRUE/FALSE whether NA should be imputed
`donorcond`	list of length equal to the number of variables, with a donorcond condition as character string. e.g. a list element can be ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable.
`mixed`	names of mixed variables
`mixed.constant`	vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability
`trace`	TRUE/FALSE if additional information about the imputation process should be printed
`imp_var`	TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status
`imp_suffix`	suffix for the TRUE/FALSE variables showing the imputation status
`addRF`	TRUE/FALSE each variable will be modelled using random forest regression (`ranger::ranger()`) and used as additional distance variable.
`onlyRF`	TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables.
`addRandom`	TRUE/FALSE if an additional random variable should be added for distance calculation
`useImputedDist`	TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables.
`weightDist`	TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step
`methodStand`	either "range" or "iqr" to be used in the standardization of numeric vaiables in the gower distance
`ordFun`	function for aggregating the k Nearest Neighbours in the case of a ordered factor variable

Value

the imputed data set.

Author(s)

Alexander Kowarik, Statistik Austria

References

A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.

Examples


data(sleep)
kNN(sleep)
library(laeken)
kNN(sleep, numFun = weightedMean, weightDist=TRUE)

[Package VIM version 6.2.2 Index]