kNN {VIM} | R Documentation |
k-Nearest Neighbour Imputation
Description
k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.
Usage
kNN(
data,
variable = colnames(data),
metric = NULL,
k = 5,
dist_var = colnames(data),
weights = NULL,
numFun = median,
catFun = maxCat,
makeNA = NULL,
NAcond = NULL,
impNA = TRUE,
donorcond = NULL,
mixed = vector(),
mixed.constant = NULL,
trace = FALSE,
imp_var = TRUE,
imp_suffix = "imp",
addRF = FALSE,
onlyRF = FALSE,
addRandom = FALSE,
useImputedDist = TRUE,
weightDist = FALSE,
methodStand = "range",
ordFun = medianSamp
)
Arguments
data |
data.frame or matrix |
variable |
variables where missing values should be imputed |
metric |
metric to be used for calculating the distances between |
k |
number of Nearest Neighbours used |
dist_var |
names or variables to be used for distance calculation |
weights |
weights for the variables for distance calculation.
If |
numFun |
function for aggregating the k Nearest Neighbours in the case of a numerical variable |
catFun |
function for aggregating the k Nearest Neighbours in the case of a categorical variable |
makeNA |
list of length equal to the number of variables, with values, that should be converted to NA for each variable |
NAcond |
list of length equal to the number of variables, with a condition for imputing a NA |
impNA |
TRUE/FALSE whether NA should be imputed |
donorcond |
list of length equal to the number of variables, with a donorcond condition as character string. e.g. a list element can be ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable. |
mixed |
names of mixed variables |
mixed.constant |
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability |
trace |
TRUE/FALSE if additional information about the imputation process should be printed |
imp_var |
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status |
imp_suffix |
suffix for the TRUE/FALSE variables showing the imputation status |
addRF |
TRUE/FALSE each variable will be modelled using random forest regression ( |
onlyRF |
TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables. |
addRandom |
TRUE/FALSE if an additional random variable should be added for distance calculation |
useImputedDist |
TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables. |
weightDist |
TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step |
methodStand |
either "range" or "iqr" to be used in the standardization of numeric vaiables in the gower distance |
ordFun |
function for aggregating the k Nearest Neighbours in the case of a ordered factor variable |
Value
the imputed data set.
Author(s)
Alexander Kowarik, Statistik Austria
References
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
See Also
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
Examples
data(sleep)
kNN(sleep)
library(laeken)
kNN(sleep, numFun = weightedMean, weightDist=TRUE)