bvs {BVSNLP}  R Documentation 
High dimensional Bayesian variable selection using nonlocal priors
Description
This function performs Bayesian variable selection for high dimensional design matrix using iMOM prior for nonzero coefficients. It also performs adaptive hyperparameter selection for iMOM prior. Cleaning the data in a preprocessing step and before any data analysis is done by user preference. This function is for binary and survival time response datasets. In the former, MCMC is used to search in the model space while for the latter a stochastic search does that job. This function has the option to do all the mentioned tasks in a parallel fashion, exploiting hundreds of CPUs. It is highly recommended to use a cluster for this purpose. This function also supports fixing covariates in variable selection process, thus making them included in the final selected model with probability 1. Categorical variable are also supported by this function as input covariates to the selection process. They need to be well defined factor variables as part of the input data frame. For the output, this function reports necessary measurements that is common in Bayesian variable selection algorithms. They include Highest Posterior Probability model, median probability model and posterior inclusion probability for each of the covariates in the design matrix.
Usage
bvs(
X,
resp,
prep = TRUE,
logT = FALSE,
fixed_cols = NULL,
eff_size = 0.5,
family = c("logistic", "survival"),
hselect = TRUE,
nlptype = "piMOM",
r = 1,
tau = 0.25,
niter = 30,
mod_prior = c("unif", "beta"),
inseed = NULL,
cplng = FALSE,
ncpu = 4,
parallel.MPI = FALSE
)
Arguments
X 
The 
resp 
For logistic regression models it is the binary response vector which could be either numeric or factor variable in R. For the Cox proportional hazard models this is a two column matrix where the first column contains survival time vector and the second column is the censoring status for each observation. 
prep 
A boolean variable determining if the preprocessing step should
be performed on the design matrix or not. That step contains removing
columns that have 
logT 
A boolean variable determining if log transform should be done on continuous columns before scaling them in the preprocessing step. Note that those columns should not contain any zeros or negative values. 
fixed_cols 
A vector of indices showing the columns in the input
data frame that are not subject to the the selection procedure. These
columns are always in the final selected model. Note that if any of these
columns contain 
eff_size 
This is the expected effect size in the model for a standardized design matrix, which is basically the coefficient value that is expected to occur the most based on some prior knowledge. 
family 
Determines the type of data analysis. 
hselect 
A boolean variable indicating whether the automatic procedure
for hyperparameter selection should be run or not. The default value is

nlptype 
Determines the type of nonlocal prior that is used in the analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for product moment prior. The default is set to piMOM prior. 
r 
The paramter 
tau 
The paramter 
niter 
Number of iterations. For binary response data, this determines the number of MCMC iterations per CPU. For survival response data this is the number of iterations per temperature schedule in the stochastic search algorithm. 
mod_prior 
Type of prior used for the model space. 
inseed 
The input seed for making the parallel processing
reproducible. This parameter is ignored in logistic regression models when

cplng 
This parameter is only used in logistic regression models, and indicating if coupling algorithm for MCMC output should be performed or not. 
ncpu 
This is the number of cpus used in parallel processing. For logistic regression models this is the number of parallel coupled chains run at the same time. For survival outcome data this is the number of cpus doing stochastic search at the same time to increase th enumber of visited models. 
parallel.MPI 
A boolean variable determining if MPI is used for
parallel processing or not. Note that in order to use this feature, your
system should support MPI and 
Value
It returns a list containing different objects that depend on the
family of the model and the coupling flag for logistic regression models.
The following describes the objects in the output list based on different
combinations of those two input arguments.
1) family = logistic && cplng = FALSE
num_vis_models 
Number of unique models visited throughout the search of the model space. 
max_prob 
Maximum unnormalized probability among all visited models 
HPM 
The indices of the model with highest posterior
probability among all visited models, with respect to the columns in
the output 
beta_hat 
The coefficient vector for the selected model. The first component is always for the intercept. 
MPM 
The indices of median probability model. According to the paper
Barbieri et. al., this is defined to be the model consisting of those
variables whose posterior inclusion probability is at least 1/2. The order
of columns is similar to that is explained for 
max_prob_vec 
A 
max_models 
A list containing models corresponding to

inc_probs 
A vector of length 
nlptype 
The type of nonlocal prior used in the analyses. 
des_mat 
The design matrix used in the analysis where fixed columns
are moved to the beginning of the matrix and if 
gene_names 
Names of the genes extracted from the design matrix. 
r 
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. 
tau 
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. 
2) family = logistic && cplng = TRUE
cpl_percent 
Shows what percentage of pairs of chains are coupled. 
margin_probs 
A 
chains 
A 
nlptype 
The type of nonlocal prior used in the analyses. 
cpl_flags 
A 
beta_hat 
A 
uniq_models 
A list showing unique models with the indices of the included covariates at each model. 
freq 
Frequency of each of the unique models. It is used to find the highest frquency model. 
probs 
Unnormalized probability of each of the unique models. 
des_mat 
The design matrix used in the analysis where fixed columns
are moved to the beginning of the matrix and if 
gene_names 
Names of the genes extracted from the design matrix. 
r 
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. 
tau 
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. 
3) family = survival
num_vis_models 
Number of visited models during the whole process. 
max_prob 
The unnormalized probability of the maximum model among all visited models. 
HPM 
The indices of the model with highest posterior
probability among all visited models, with respect to the columns in

MPM 
The indices of median probability model. According to the paper
Barbieri et. al., this is defined to be the model consisting of those
variables whose posterior inclusion probability is at least 1/2. The order
of columns is similar to that is explained for 
beta_hat 
The coefficient vector for the selected model reported in

max_prob_vec 
A 
max_models 
A list containing models corresponding to

inc_probs 
A 
nlptype 
The type of nonlocal prior used in the analyses. 
des_mat 
The design matrix used in the analysis where fixed columns
are moved to the beginning of the matrix and if 
start_models 
A 
gene_names 
Names of the genes extracted from the design matrix. 
r 
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. 
tau 
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. 
Author(s)
Amir Nikooienejad
References
Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian
variable selection for binary outcomes in high dimensional genomic studies
using nonlocal priors. Bioinformatics, 32(9), 13381345.
Nikooienejad, A., Wang, W., & Johnson, V. E. (2020). Bayesian variable
selection for survival data using inverse moment priors. Annals of Applied
Statistics, 14(2), 809828.
Johnson, V. E. (1998). A couplingregeneration scheme for
diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of
the American Statistical Association, 93(441), 238248.
Shin, M., Bhattacharya, A., and Johnson, V. E. (2017). Scalable Bayesian
variable selection using nonlocal prior densities in ultrahigh dimensional
settings. Statistica Sinica.
Johnson, V. E., and Rossell, D. (2010). On the use of nonlocal prior
densities in Bayesian hypothesis tests. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72(2), 143170.
Barbieri, M. M., and Berger, J. O. (2004). Optimal predictive model
selection. The annals of statistics, 32(3), 870897.
See Also
Examples
### Simulating Logistic Regression Data
n < 200
p < 40
set.seed(123)
Sigma < diag(p)
full < matrix(c(rep(0.5, p*p)), ncol=p)
Sigma < full + 0.5*Sigma
cholS < chol(Sigma)
Beta < c(1.9,1.3,2.2)
X < matrix(rnorm(n*p), ncol=p)
X < X%*%cholS
beta < numeric(p)
beta[c(1:length(Beta))] < Beta
XB < X%*%beta
probs < as.vector(exp(XB)/(1+exp(XB)))
y < rbinom(n,1,probs)
colnames(X) < paste("gene_",c(1:p),sep="")
X < as.data.frame(X)
### Running 'bvs' function without coupling and with hyperparamter selection
### procedure
bout < bvs(X, y, family = "logistic", nlptype = "piMOM",
mod_prior = "beta", niter = 50)
### Highest Posterior Model
bout$HPM
### Estimated Coefficients:
bout$beta_hat
### Number of Visited Models:
bout$num_vis_models