R: Variable selection method with multiple block-wise imputation...

MBI {BlockMissingData}

R Documentation

Variable selection method with multiple block-wise imputation (MBI)

Description

Fit a variable selection method with multiple block-wise imputation (MBI).

Usage

MBI(
  X,
  y,
  cov_index,
  sub_index,
  miss_source,
  complete,
  lambda = NULL,
  eps1 = 0.001,
  eps2 = 1e-07,
  eps3 = 1e-08,
  max.iter = 1000,
  lambda.min = ifelse(n > p, 0.001, 0.05),
  nlam = 100,
  beta0 = NULL,
  a = 3.7,
  gamma.ebic = 0.5,
  alpha1 = 0.5^(0:12),
  h1 = 2^(-(8:30)),
  ratio = 1
)

Arguments

`X`	Design matrix for block-wise missing covariates.
`y`	Response vector.
`cov_index`	Starting indexes of covariates in data sources.
`sub_index`	Starting indexes of subjects in missing groups.
`miss_source`	Indexes of missing data sources in missing groups, respectively ('NULL' represents no missing).
`complete`	Logical indicator of whether there is a group of complete cases. If there is a group of complete cases, it should be the first group. 'TRUE' represents that there is a group of complete cases.
`lambda`	A user supplied sequence of tuning parameter in penalty. If NULL, a sequence is automatically generated.
`eps1`	Convergence threshold at certain stage of the algorithm. Default is 1e-3.
`eps2`	Convergence threshold at certain stage of the algorithm. Default is 1e-7.
`eps3`	Convergence threshold at certain stage of the algorithm. Default is 1e-8.
`max.iter`	The maximum number of iterations allowed. Default is 1000.
`lambda.min`	Smallest value for `lambda`, as a fraction of the maximum value in `lambda`. Default depends on the size of input.
`nlam`	The number of `lambda` values. Default is 100.
`beta0`	Initial value for regression coefficients. If NULL, they are initialized automatically.
`a`	Tuning parameter in the SCAD penalty. Default is 3.7.
`gamma.ebic`	Parameter in the EBIC criterion. Default is 0.5.
`alpha1`	A sequence of candidate values for the step size in the conjugate gradient algorithm. Default is 0.5^(0:12).
`h1`	A sequence of candidate values for the parameter in the numerical calculation of the first derivative of the objective function. Default is 2^(-(8:30)).
`ratio`	Parameter in the numerical calculation of the first derivative. Default is 1.

Details

The function uses the penalized generalized method of moments with multiple block-wise imputation to handle block-wise missing data, commonly found in multi-source datasets.

Value

`beta`	Estimated coefficients matrix with `length(lambda)` rows and `dim(X)[2]` columns.
`lambda`	The actual sequence of `lambda` values used.
`bic1`	BIC criterion values. '0' should be ignored.
`notcon`	Value indicating whether the algorithm is converged or not. '0' represents convergence; otherwise non-convergence.
`intercept`	Intercept sequence of length `length(lambda)`.
`beta0`	Estimated coefficients matrix for standardized `X`

Author(s)

Fei Xue and Annie Qu

References

Xue, F., and Qu, A. (2021) Integrating Multisource Block-Wise Missing Data in Model Selection (2021), Journal of the American Statistical Association, Vol. 116(536), 1914-1927.

Examples


library(MASS)

# Number of subjects
n <- 30

# Number of total covariates
p <- 4

# Number of missing groups of subjects
ngroup <- 2

# Number of data sources
nsource <- 2

# Starting indexes of covariates in data sources
cov_index=c(1, 3)

# Starting indexes of subjects in missing groups
sub_index=c(1, 16)

# Indexes of missing data sources in missing groups, respectively ('NULL' represents no missing)
miss_source=list(NULL, 1)

# Indicator of whether there is a group of complete cases. If there is a group of complete cases,
# it should be the first group.
complete=TRUE

# Create a block-wise missing design matrix X and response vector y
set.seed(1)
sigma=diag(1-0.4,p,p)+matrix(0.4,p,p)
X <- mvrnorm(n,rep(0,p),sigma)
beta_true <- c(2.5, 0, 3, 0)
y <- rnorm(n) + X%*%beta_true

for (i in 1:ngroup) {
  if (!is.null(miss_source[[i]])) {
    if (i==ngroup) {
      if (miss_source[[i]]==nsource) {
        X[sub_index[i]:n, cov_index[miss_source[[i]]]:p] = NA
      } else {
        X[sub_index[i]:n, cov_index[miss_source[[i]]]:(cov_index[miss_source[[i]]+1]-1)] = NA
      }
    } else {
      if (miss_source[[i]]==nsource) {
        X[sub_index[i]:(sub_index[i+1]-1), cov_index[miss_source[[i]]]:p] = NA
      } else {
        X[sub_index[i]:(sub_index[i+1]-1), cov_index[miss_source[[i]]]:
        (cov_index[miss_source[[i]]+1]-1)] = NA
      }
    }
  }
}

# Now we can use the function with this simulated data
#start.time = proc.time()
result <- MBI(X=X, y=y, cov_index=cov_index, sub_index=sub_index, miss_source=miss_source,
complete=complete, nlam = 15, eps2 = 1e-3, h1=2^(-(8:20)))
#time = proc.time() - start.time

theta=result$beta
bic1=result$bic1
best=which.min(bic1[bic1!=0])
beta_est=theta[best,]

[Package BlockMissingData version 0.1.0 Index]