R: Missing Value Imputation with kNN for High-Dimensional Data

knncatimputeLarge {scrime}

R Documentation

Missing Value Imputation with kNN for High-Dimensional Data

Description

Imputes missing values in a high-dimensional matrix composed of categorical variables using k Nearest Neighbors.

Usage

knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL,
   nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"), 
   n.num = 100, use.weights = TRUE, verbose = FALSE)

Arguments

`data`	a numeric matrix consisting of integers between 1 and `n_{cat}`, where `n_{cat}` is maximum number of levels the categorical variables can take. If `mat.na` is specified, `data` is assumed to contain only non-missing data, and the rows of `data` are used to impute the missing values in `mat.na`. Otherwise, `data` is also allowed to contain missing values, and the missing values in the rows of `data` are imputed by employing the rows of `data` showing no missing values. Each row of `data` represents one of the objects that should be used to identify the `k` nearest neighbors, i.e.\ if the `k` nearest variables should be used to replace the missing values, then each row must represent one of the variables. If the `k` nearest observations should be used to impute the missing values, then each row must correspond to one of the observations.
`mat.na`	a numeric matrix containing missing values. Must have the same number of columns as `data`. All non-missing values must be integers between 1 and `n_{cat}`. If `NULL`, `data` is assumed to also contain the rows with missing values.
`fac`	a numeric or character vector of length `nrow{data}` specifying the values of a factor used to split `data` into subsets. If, e.g., the values of `fac` are given by the chromosomes to which the SNPs represented by the rows of `data` belong, then `k` nearest neighbors is applied chromosomewise to the missing values in `mat.na` (or `data`). If `NULL`, no such splitting is done. Must be specified, if `fac.na` is specified.
`fac.na`	a numeric or character vector of length `nrow{mat.na}` specifying the values of a factor by which `mat.na` is split into subsets. Each possible value of `fac.na` must be at least `nn` times in `fac`. Must be specified, if `fac` and `mat.na` is specified. If both `fac` and `fac.na` are `NULL`, then no splitting is done.
`nn`	an integer specifying `k`, i.e.\ the number of nearest neighbors, used to impute the missing values.
`distance`	character string naming the distance measure used in `k` Nearest Neighbors. Must be either `"smc"` (default), `"cohen"`, `"snp1norm"` (which denotes the Manhattan distance for SNPs), or `"pcc"`.
`n.num`	an integer giving the number of rows of `mat.na` considered simultaneously when replacing the missing values in `mat.na`.
`use.weights`	should weighted `k` nearest neighbors be used to impute the missing values? If `TRUE`, the votes of the nearest neighbors are weighted by the reciprocal of their distances to the variable (or observation) whose missing values are imputed.
`verbose`	should more information about the progress of the imputation be printed?

Value

If mat.na = NULL, then a matrix of the same size as data in which the missing values have been replaced. If mat.na has been specified, then a matrix of the same size as mat.na in which the missing values have been replaced.

Note

While in knncatimpute all variable/rows are considered when replacing missing values, knncatimputeLarge only considers the rows with no missing values when searching for the k nearest neighbors.

Author(s)

Holger Schwender, holger.schwender@udo.edu

References

Schwender, H. and Ickstadt, K.\ (2008). Imputing Missing Genotypes with k Nearest Neighbors. Technical Report, SFB 475, Department of Statistics, University of Dortmund. Appears soon.

Examples

## Not run: 
# Generate a data set consisting of 100 columns and 2000 rows (actually,
# knncatimputeLarge is made for much larger data sets), where the values
# are randomly drawn from the integers 1, 2, and 3.
# Afterwards, remove 200 of the observations randomly.

mat <- matrix(sample(3, 200000, TRUE), 2000)
mat[sample(200000, 20)] <- NA

# Apply knncatimputeLarge to mat to remove the missing values.

mat2 <- knncatimputeLarge(mat)
sum(is.na(mat))
sum(is.na(mat2))

# Now assume that the first 100 rows belong to SNPs from chromosome 1,
# the second 100 rows to SNPs from chromosome 2, and so on.

chromosome <- rep(1:20, e = 100)

# Apply knncatimputeLarge to mat chromosomewise, i.e. only consider
# the SNPs that belong to the same chromosome when replacing missing
# genotypes.

mat4 <- knncatimputeLarge(mat, fac = chromosome)


## End(Not run)

[Package scrime version 1.3.5 Index]