R: Missing Value Imputation in Parallel

misspi {misspi}

R Documentation

Missing Value Imputation in Parallel

Description

Enables embarrassingly parallel computing for imputation. Some of the advantages include

Provides fast implementation especially for high dimensional datasets.
Accepts a variety of machine learning models as methods with friendly user portal.
Supports multiple initializations.
Supports early stopping that prohibits unnecessary iterations.

Usage

misspi(
  x,
  ncore = NULL,
  init.method = "rf",
  method = "rf",
  earlystopping = TRUE,
  ntree = 100,
  init.ntree = 100,
  viselect = NULL,
  lgb.params = NULL,
  lgb.params0 = NULL,
  model.train = NULL,
  pmm = TRUE,
  nn = 3,
  intcol = NULL,
  maxiter = 10,
  rdiff.thre = 0.01,
  verbose = TRUE,
  progress = TRUE,
  nlassofold = 5,
  isis = FALSE,
  char = " * ",
  iteration = TRUE,
  ndecimal = NULL,
  ...
)

Arguments

`x`	a matrix of numerical values for imputation, missing value should all be "NA".
`ncore`	number of cores to use, will be set to the cores detected as default.
`init.method`	initializing method to fill in the missing value before imputation. Support "rf" for random forest imputation as default, "mean" for mean imputation, "median" for median imputation.
`method`	method name for the imputation, support "rf" for random forest, "lgb" for lightgbm, "lasso" for LASSO, or "customize" if you want to use your own method.
`earlystopping`	a Boolean which indicates whether to stop the algorithm if the relative difference stop decreasing, with TRUE as default.
`ntree`	number of trees to use for imputation when method is "rf" or "gbm".
`init.ntree`	number of trees to use for initialization when method is "rf"
`viselect`	the number of variables with highest variable importance calculated from random forest initialization to work on if the value is not NULL. This would only work when init.method is "rf", and method is "rf" or "gbm".
`lgb.params`	parameters to customize for lightgbm models, could be invoked when method is "rf" or "gbm".
`lgb.params0`	parameters to customize for initialization using random forest, could be invoked when init.method is "rf".
`model.train`	machine learning model to be invoked for customizing the imputation. Only invoked when parameter method = "customize". The input model should be able to take y~x for fitting process where y, and x are matrices, also make sure that it could be called using method "predict" for model prediction. You could pass the parameters for the model through the additional arguments ...
`pmm`	a Boolean which indicated whether to use predictive mean matching.
`nn`	number of neighbors to use for prediction if predictive mean matching is invoked (pmm is "TRUE").
`intcol`	a vector of indices of columns that are know to be integer, and will be round to integer in every iteration.
`maxiter`	maximum number of iterations for imputation.
`rdiff.thre`	relative difference threshold for determining the imputation convergence.
`verbose`	a Boolean that indicates whether to print out the intermediate steps verbally.
`progress`	a Boolean that indicates whether to show the progress bar.
`nlassofold`	number of folds for cross validation when the method is "lasso".
`isis`	a Boolean that indicates whether to use isis if the method is "lasso", recommended to use for ultra high dimension.
`char`	a character to use which also accept unicode for progress bar. For example, u03c, u213c for pi, u2694 for swords, u2605 for star, u2654 for king, u26a1 for thunder, u2708 for plane.
`iteration`	a Boolean that indicates whether use iterative algorithm.
`ndecimal`	number of decimals to round for the result, with NULL meaning no intervention.
`...`	other arguments to be passed to the method.

Value

a list that contains the imputed values, time consumed and number of iterations.

x.imputed the imputed matrix.

time.elapsed time consumed for the algorithm.

niter number of iterations used in the algorithm.

Author(s)

Zhongli Jiang jiang548@purdue.edu

References

Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), 40-49.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5), 849-911.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.

Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

Examples



# Quick example 1
# Load a small data
data(iris)
# Keep numerical columns
num.col <- which(sapply(iris, is.numeric))
iris.numeric <- as.matrix(iris[, num.col])
set.seed(0)
iris.miss <- missar(iris.numeric, 0.3, 1)
iris.impute <- misspi(iris.miss)
iris.impute

# Quick example 2
# Load a high dimensional data
data(toxicity, package = "misspi")
set.seed(0)
toxicity.miss <- missar(toxicity, 0.4, 0.2)
toxicity.impute <- misspi(toxicity.miss)
toxicity.impute

# Change cores
iris.impute.5core <- misspi(iris.miss, ncore = 5)

# Change initialization and maximum iterations (no iteration in the example)
iris.impute.mean.5iter <- misspi(iris.miss, init.method = "mean", maxiter = 0)

# Change fun shapes for progress bar
iris.impute.king <- misspi(iris.miss, char = " \u2654")


# Use variable selection
toxicity.impute.vi <- misspi(toxicity.miss, viselect = 128)


# Use different machine learning algorithms as method
# linear model
iris.impute.lm <- misspi(iris.miss, model.train = lm)

# From external packages
# Support Vector Machine (SVM)

library(e1071)
iris.impute.svm.radial <- misspi(iris.miss, model.train = svm)


# Neural Networks

library(neuralnet)
iris.impute.nn <- misspi(iris.miss, model.train = neuralnet)

[Package misspi version 0.1.0 Index]