R: Variable selection for specifying conditional imputation...

varselbest {clusterMI}

R Documentation

Variable selection for specifying conditional imputation models

Description

varselbest performs variable selection from an incomplete dataset (see Bar-Hen and Audigier (2022) <doi:10.1080/00949655.2022.2070621>) in order to specify the imputation models to use for FCS imputation methods

Usage

varselbest(
  data.na = NULL,
  res.imputedata = NULL,
  listvar = NULL,
  nb.clust = NULL,
  nnodes = 1,
  sizeblock = 5,
  method.select = "knockoff",
  B = 200,
  r = 0.3,
  graph = TRUE,
  printflag = TRUE,
  path.outfile = NULL,
  mar = c(2, 4, 2, 0.5) + 0.1,
  cex.names = 0.7,
  modelNames = NULL
)

Arguments

`data.na`	a dataframe with only numeric variables
`res.imputedata`	an output from `imputedata`
`listvar`	a character vector indicating for which subset of incomplete variables variable selection must be performed. By default all column names.
`nb.clust`	the number of clusters used for imputation
`nnodes`	number of CPU cores for parallel computing. By default, nnodes = 1
`sizeblock`	an integer indicating the number of variables sampled at each iteration
`method.select`	a single string indicating the variable selection method applied on each subset of variables
`B`	number of iterations, by default B = 200
`r`	a numerical vector (or a single real number) indicating the threshold used for each variable in listvar. Each value of r should be between 0 and 1. See details.
`graph`	a boolean. If TRUE two graphics are plotted per variable in `listvar`: a graphic reporting the variable importance measure of each explanatory variable and a graphic reporting the influence of the number iterations (B) on the importance measures
`printflag`	a boolean. If TRUE, a message is printed at each iteration. Use printflag = FALSE for silent selection.
`path.outfile`	a vector of strings indicating the path for redirection of print messages. Default value is NULL, meaning that silent imputation is performed. Otherwise, print messages are saved in the files path.outfile/output.txt. One file per node is generated.
`mar`	a numerical vector of the form c(bottom, left, top, right). Only used if graph = TRUE
`cex.names`	expansion factor for axis names (bar labels) (only used if graph = TRUE)
`modelNames`	a vector of character strings indicating the models to be fitted in the EM phase of clustering

Details

varselbest performs variable selection on random subsets of variables and, then, combines them to recover which explanatory variables are related to the response. More precisely, the outline of the algorithm are as follows: let consider a random subset of sizeblock among p variables. By choosing sizeblock small, this subset is low dimensional, allowing treatment of missing values by standard imputation method for clustered individuals. Then, any selection variable scheme can be applied (lasso, stepwise and knockoff are proposed by tuning the method.select argument). By resampling B times, a sample of size sizeblock among the p variables, we may count how many times, a variable is considered as significantly related to the response and how many times it is not. We need to define a threshold (r) to conclude if a given variable is significantly related to the response.

Value

a list of four objects

`predictormatrix`	a numeric matrix containing 0 and 1 specifying on each line the set of predictors to be used for each target column of the incomplete dataset.
`res.varsel`	a list given details on the variable selection procedure (only required for checking convergence by the `chooseB` function)
`proportion`	a numeric matrix of proportion indicating on each line the variable importance of each predictor
`call`	the matching call

References

Bar-Hen, A. and Audigier, V., An ensemble learning method for variable selection: application to high dimensional data and missing values, Journal of Statistical Computation and Simulation, <doi:10.1080/00949655.2022.2070621>, 2022.

Examples

data(wine)

require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)


nnodes <- 2 # parallel::detectCores()
B <- 150 #  Number of iterations
m <- 5 # Number of imputed data sets

# variable selection
res.varsel <- varselbest(data.na = wine.na,
                         nb.clust = nb.clust,
                         listvar = c("alco","malic"),
                         B = B,
                         nnodes = nnodes)
predictmat <- res.varsel$predictormatrix

# imputation
res.imp.select <- imputedata(data.na = wine.na, method = "FCS-homo",
                     nb.clust = nb.clust, predictmat = predictmat, m = m)

[Package clusterMI version 1.2.1 Index]