var_selection_by_permute {bartMachine}R Documentation

Perform Variable Selection using Three Threshold-based Procedures

Description

Performs variable selection using the three thresholding methods introduced in Bleich et al. (2013).

Usage

var_selection_by_permute(bart_machine, 
num_reps_for_avg = 10, num_permute_samples = 100, 
num_trees_for_permute = 20, alpha = 0.05, 
plot = TRUE, num_var_plot = Inf, bottom_margin = 10)

Arguments

bart_machine

An object of class “bartMachine”.

num_reps_for_avg

Number of replicates to over over to for the BART model's variable inclusion proportions.

num_permute_samples

Number of permutations of the response to be made to generate the “null” permutation distribution.

num_trees_for_permute

Number of trees to use in the variable selection procedure. As with
investigate_var_importance, a small number of trees should be used to force variables to compete for entry into the model. Note that this number is used to estimate both the “true” and “null” variable inclusion proportions.

alpha

Cut-off level for the thresholds.

plot

If TRUE, a plot showing which variables are selected by each of the procedures is generated.

num_var_plot

Number of variables (in order of decreasing variable inclusion proportion) to be plotted.

bottom_margin

A display parameter that adjusts the bottom margin of the graph if labels are clipped. The scale of this parameter is the same as set with par(mar = c(....)) in R. Higher values allow for more space if the crossed covariate names are long. Note that making this parameter too large will prevent plotting and the plot function in R will throw an error.

Details

See Bleich et al. (2013) for a complete description of the procedures outlined above as well as the corresponding vignette for a brief summary with examples.

Value

Invisibly, returns a list with the following components:

important_vars_local_names

Names of the variables chosen by the Local procedure.

important_vars_global_max_names

Names of the variables chosen by the Global Max procedure.

important_vars_global_se_names

Names of the variables chosen by the Global SE procedure.

important_vars_local_col_nums

Column numbers of the variables chosen by the Local procedure.

important_vars_global_max_col_nums

Column numbers of the variables chosen by the Global Max procedure.

important_vars_global_se_col_nums

Column numbers of the variables chosen by the Global SE procedure.

var_true_props_avg

The variable inclusion proportions for the actual data.

permute_mat

The permutation distribution generated by permuting the response vector.

Note

Although the reference only explores regression settings, this procedure is applicable to both regression and classification problems. This function is parallelized by the number of cores set in set_bart_machine_num_cores.

Author(s)

Adam Kapelner and Justin Bleich

References

J Bleich, A Kapelner, ST Jensen, and EI George. Variable Selection Inference for Bayesian Additive Regression Trees. ArXiv e-prints, 2013.

Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi:10.18637/jss.v070.i04

See Also

var_selection_by_permute, investigate_var_importance

Examples

## Not run: 
#generate Friedman data
set.seed(11)
n  = 300 
p = 20 ##15 useless predictors 
X = data.frame(matrix(runif(n * p), ncol = p))
y = 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n)

##build BART regression model (not actuall used in variable selection)
bart_machine = bartMachine(X, y)

#variable selection
var_sel = var_selection_by_permute(bart_machine)
print(var_sel$important_vars_local_names)
print(var_sel$important_vars_global_max_names)

## End(Not run)
  

[Package bartMachine version 1.3.4.1 Index]