bartMachine {bartMachine} | R Documentation |

Builds a BART model for regression or classification.

bartMachine(X = NULL, y = NULL, Xy = NULL, num_trees = 50, num_burn_in = 250, num_iterations_after_burn_in = 1000, alpha = 0.95, beta = 2, k = 2, q = 0.9, nu = 3, prob_rule_class = 0.5, mh_prob_steps = c(2.5, 2.5, 4)/9, debug_log = FALSE, run_in_sample = TRUE, s_sq_y = "mse", sig_sq_est = NULL, cov_prior_vec = NULL, use_missing_data = FALSE, covariates_to_permute = NULL, num_rand_samps_in_library = 10000, use_missing_data_dummies_as_covars = FALSE, replace_missing_data_with_x_j_bar = FALSE, impute_missingness_with_rf_impute = FALSE, impute_missingness_with_x_j_bar_for_lm = TRUE, mem_cache_for_speed = TRUE, flush_indices_to_save_RAM = TRUE, serialize = FALSE, seed = NULL, verbose = TRUE) build_bart_machine(X = NULL, y = NULL, Xy = NULL, num_trees = 50, num_burn_in = 250, num_iterations_after_burn_in = 1000, alpha = 0.95, beta = 2, k = 2, q = 0.9, nu = 3, prob_rule_class = 0.5, mh_prob_steps = c(2.5, 2.5, 4)/9, debug_log = FALSE, run_in_sample = TRUE, s_sq_y = "mse", sig_sq_est = NULL, cov_prior_vec = NULL, use_missing_data = FALSE, covariates_to_permute = NULL, num_rand_samps_in_library = 10000, use_missing_data_dummies_as_covars = FALSE, replace_missing_data_with_x_j_bar = FALSE, impute_missingness_with_rf_impute = FALSE, impute_missingness_with_x_j_bar_for_lm = TRUE, mem_cache_for_speed = TRUE, flush_indices_to_save_RAM = TRUE, serialize = FALSE, seed = NULL, verbose = TRUE)

`X` |
Data frame of predictors. Factors are automatically converted to dummies internally. |

`y` |
Vector of response variable. If |

`Xy` |
A data frame of predictors and the response. The response column must be named “y”. |

`num_trees` |
The number of trees to be grown in the sum-of-trees model. |

`num_burn_in` |
Number of MCMC samples to be discarded as “burn-in”. |

`num_iterations_after_burn_in` |
Number of MCMC samples to draw from the posterior distribution of |

`alpha` |
Base hyperparameter in tree prior for whether a node is nonterminal or not. |

`beta` |
Power hyperparameter in tree prior for whether a node is nonterminal or not. |

`k` |
For regression, |

`q` |
Quantile of the prior on the error variance at which the data-based estimate is placed. Note that the larger the value of |

`nu` |
Degrees of freedom for the inverse |

`prob_rule_class` |
Threshold for classification. Any observation with a conditional probability greater than |

`mh_prob_steps` |
Vector of prior probabilities for proposing changes to the tree structures: (GROW, PRUNE, CHANGE) |

`debug_log` |
If TRUE, additional information about the model construction are printed to a file in the working directory. |

`run_in_sample` |
If TRUE, in-sample statistics such as |

`s_sq_y` |
If “mse”, a data-based estimated of the error variance is computed as the MSE from ordinary least squares regression. If “var”., the data-based estimate is computed as the variance of the response. Not used in classification. |

`sig_sq_est` |
Pass in an estimate of the maximum sig_sq of the model. This is useful to cache somewhere and then pass in during cross-validation since the default method of estimation is a linear model. In large dimensions, linear model estimation is slow. |

`cov_prior_vec` |
Vector assigning relative weights to how often a particular variable should be proposed as a candidate for a split. The vector is internally normalized so that the weights sum to 1. Note that the length of this vector must equal the length of the design matrix after dummification and augmentation of indicators of missingness (if used). To see what the dummified matrix looks like, use |

`use_missing_data` |
If TRUE, the missing data feature is used to automatically handle missing data without imputation. See Kapelner and Bleich (2013) for details. |

`covariates_to_permute` |
Private argument for |

`num_rand_samps_in_library` |
Before building a BART model, samples from the Standard Normal and |

`use_missing_data_dummies_as_covars` |
If TRUE, additional indicator variables for whether or not an observation in a particular column is missing are included. See Kapelner and Bleich (2013) for details. |

`replace_missing_data_with_x_j_bar` |
If TRUE ,missing entries in |

`impute_missingness_with_rf_impute` |
If TRUE, missing entries are filled in using the rf.impute() function from the |

`impute_missingness_with_x_j_bar_for_lm` |
If TRUE, when computing the data-based estimate of |

`mem_cache_for_speed` |
Speed enhancement that caches the predictors and the split values that are available at each node for selecting new rules. If the number of predictors is large, the memory requirements become large. We recommend keeping this on (default) and turning it off if you experience out-of-memory errors. |

`flush_indices_to_save_RAM` |
Setting this flag to |

`serialize` |
Setting this option to |

`seed` |
Optional: sets the seed in both R and Java. Default is |

`verbose` |
Prints information about progress of the algorithm to the screen. |

Returns an object of class “bartMachine”. The “bartMachine” object contains a list of the following components:

`java_bart_machine` |
A pointer to the BART Java object. |

`train_data_features` |
The names of the variables used in the training data. |

`training_data_features_with_missing_features.` |
The names of the variables used in the training data. If |

`y` |
The values of the response for the training data. |

`y_levels` |
The levels of the response (for classification only). |

`pred_type` |
Whether the model was build for regression of classification. |

`model_matrix_training_data` |
The training data with factors converted to dummies. |

`num_cores` |
The number of cores used to build the BART model. |

`sig_sq_est` |
The data-based estimate of |

`time_to_build` |
Total time to build the BART model. |

`y_hat_train` |
The posterior means of |

`residuals` |
The model residuals given by |

`L1_err_train` |
L1 error on the training set. Only returned if |

`L2_err_train` |
L2 error on the training set. Only returned if |

`PseudoRsq` |
Calculated as 1 - SSE / SST where SSE is the sum of square errors in the training data and SST is the sample variance of the response times |

`rmse_train` |
Root mean square error on the training set. Only returned if |

Additionally, the parameters passed to the function `bartMachine`

are also components of the list.

This function is parallelized by the number of cores set by `set_bart_machine_num_cores`

. Each core will create an
independent MCMC chain of size

`num_burn_in`

*+* `num_iterations_after_burn_in / bart_machine_num_cores`

.

Adam Kapelner and Justin Bleich

Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi:10.18637/jss.v070.i04

HA Chipman, EI George, and RE McCulloch. BART: Bayesian Additive Regressive Trees. The Annals of Applied Statistics, 4(1): 266–298, 2010.

A Kapelner and J Bleich. Prediction with Missing Data via Bayesian Additive Regression Trees. Canadian Journal of Statistics, 43(2): 224-239, 2015

J Bleich, A Kapelner, ST Jensen, and EI George. Variable Selection Inference for Bayesian Additive Regression Trees. ArXiv e-prints, 2013.

##regression example ##generate Friedman data set.seed(11) n = 200 p = 5 X = data.frame(matrix(runif(n * p), ncol = p)) y = 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ##build BART regression model bart_machine = bartMachine(X, y) summary(bart_machine) ## Not run: ##Build another BART regression model bart_machine = bartMachine(X,y, num_trees = 200, num_burn_in = 500, num_iterations_after_burn_in = 1000) ##Classification example #get data and only use 2 factors data(iris) iris2 = iris[51:150,] iris2$Species = factor(iris2$Species) #build BART classification model bart_machine = build_bart_machine(iris2[ ,1:4], iris2$Species) ##get estimated probabilities phat = bart_machine$p_hat_train ##look at in-sample confusion matrix bart_machine$confusion_matrix ## End(Not run)

[Package *bartMachine* version 1.2.6 Index]