eforest {etree} | R Documentation |
Energy Forests
Description
Fits an Energy Forest, in the form of either a bagging of Energy Trees or a
Random Energy Forest, depending on the value of the random_covs
parameter.
Usage
eforest(
response,
covariates,
weights = NULL,
ntrees = 100,
ncores = 1L,
minbucket = 1,
alpha = 1,
R = 500,
split_type = "cluster",
coeff_split_type = "test",
p_adjust_method = "fdr",
perf_metric = NULL,
random_covs = "auto",
verbose = FALSE
)
Arguments
response |
Response variable, an object of class either
|
covariates |
Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Each element (i.e., variable) in the covariates list must have the same
|
weights |
Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1. |
ntrees |
Number of Energy Trees to grow, i.e., the number of bootstrap samples to be generated and used for fitting. |
ncores |
Number of cores to use, i.e., at most how many child processes
will be run simultaneously. Must be exactly 1 on Windows (which uses the
master process). |
minbucket |
Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5. |
alpha |
Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05. |
R |
Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000. |
split_type |
Splitting method used when the selected covariate is
structured. It has two possible values: |
coeff_split_type |
Method to select the split point for the chosen
component when the selected covariate is structured and |
p_adjust_method |
Multiple-testing adjustment method for P-values,
which can be set to any of the values provided by
|
perf_metric |
Performance metric that is used to compute the Out-Of-Bag
score. If |
random_covs |
Size of the random subset of covariates to choose from at
each split. If set to |
verbose |
Logical indicating whether to print a one-line notification for the conclusion of each tree's fitting process. |
Details
eforest()
generates ntrees
bootstrap samples and then calls
etree()
on each of them. Then, it computes the Out-Of-Bag (OOB)
score using the performance metric defined through perf_metric
.
For classification, possible values of perf_metric
are "BAcc"
and "WBAcc"
. Both are general enough to be used in multiclass
classification problems, still producing sensible results in the case of
binary classification. The two options are based on the calculation of a
ground performance metric, the Balanced Accuracy, which is defined as the
arithmetic mean between Sensitivity and Specificity. In this framework,
Balanced Accuracy is computed using a "One vs. All" approach, i.e.,
considering one class at a time: positive instances are those belonging to
that class, and negatives are the ones belonging to any other class. Then,
the "One vs. All" Balanced Accuracies obtained by considering each class must
be averaged. When perf_metric = "BAcc"
(default for classification
tasks), the average is arithmetic. When perf_metric = "WBAcc"
, the
average is weighted using class sizes, hence giving more importance to the
"One vs. All" Balanced Accuracy of larger classes.
For regression, the default value of perf_metric
is "RMSPE"
,
namely, Root Mean Square Percentage Error. Other available options are
c("MAE", "MAPE", "MedianAE", "MedianAPE", "MSE", "NRMSE", "RAE",
"RMSE", "RMLSE")
. Each of these name points to the corresponding homonym
function from the package MLmetrics
, whose
documentation provides more information about their definition.
Value
Object of class "eforest"
with three elements: 1) ensemble
,
which is a list gathering all the fitted trees; 2) oob_score
,
an object of class "numeric"
representing the OOB score computed using
the performance metric defined through perf_metric
; 3)
perf_metric
, an object of class "character"
returning the
performance metric used for computations.
Examples
## Covariates
set.seed(123)
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)
## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)
## Regression ##
eforest_fit <- eforest(response = resp_reg, covariates = cov_list, ntrees = 12)
print(eforest_fit$ensemble[[1]])
plot(eforest_fit$ensemble[[1]])
mean((resp_reg - predict(eforest_fit)) ^ 2)
## Classification ##
eforest_fit <- eforest(response = resp_cls, covariates = cov_list, ntrees = 12)
print(eforest_fit$ensemble[[12]])
plot(eforest_fit$ensemble[[12]])
table(resp_cls, predict(eforest_fit))