Imputation {plasma}R Documentation

Imputation

Description

Functions to impute missing data in omics data sets.

Usage

meanModeImputer(X)
samplingImputer(X)

Arguments

X

A numeric matrix, where the columns represent independent observations (patients or samples) and the columns represent measured features (genes, proteins, clinical variables, etc).

Details

We recommend imputing small amounts of missing data in the input data sets when using the plasma package. The underlying issue is that the PLS models we use for individual omics data sets will not be able to make predictions on a sample if even one data point is missing. As a result, if a sample is missing at least one data point in every omics data set, then it will be impossible to use that sample at all.

For a range of available imputation methods and R packages, consult the CRAN Task View on Missing Data. We also recommend the R-miss-tastic web site on missing data. Their simulations suggest that, for purposes of producing predictive models from omics data, the imputation method is not particularly important. Because of the latter finding, we have only implemented two simple imputation methods in the plasma package:

  1. The meanModeImputer function will replace any missing data by the mean value of the observed data if there are more than five distinct values; otherwise, it will replace missing data by the mode. This approach works relatively well for both continuous data and for binary or small categorical data.

  2. The samplingImpute function replaces missing values by sampling randomly from the observed data distribution.

Value

Both functions return a numeric matrix of the same size and with the same row and column names as the input variable

Author(s)

Kevin R. Coombes krc@silicovore.com, Kyoko Yamaguchi kyoko.yamaguchi@osumc.edu

Examples

loadESCAdata()
imputed <- with(plasmaEnv, lapply(assemble, samplingImputer) )
imputed <- with(plasmaEnv, lapply(assemble, meanModeImputer))

[Package plasma version 1.1.3 Index]