epx {EPX}R Documentation

Fitting an Ensemble of Phalanxes

Description

epx forms phalanxes of variables from training data for binary classification with a rare class. The phalanxes are disjoint subsets of variables, each of which is fit with a base classifier. Together they form an ensemble.

Usage

epx(
  x,
  y,
  phalanxes.initial = c(1:ncol(x)),
  alpha = 0.95,
  nsim = 1000,
  rmin.target = 1,
  classifier = "random forest",
  classifier.args = list(),
  performance = "AHR",
  performance.args = list(),
  computing = "sequential",
  ...
)

Arguments

x

Explanatory variables (predictors, features) contained in a data frame.

y

Binary response variable vector (numeric or integer): 1 for the rare class, 0 for the majority class.

phalanxes.initial

Initial variable group indices; default one group per variable. Example: vector c(1, 1, 2, 2, 3, ...) puts variables 1 and 2 in group 1, variables 3 and 4 in group, 2, etc. Indices cannot be skipped, e.g., c( 1, 3, 3, 4, 4, 3, 1) skips group 2 and is invalid.

alpha

Lower-tail probability for the critical quantile of the reference distribution of the performance measure for a classifier that ranks at random (i.e., the predictors have no explanatory power); default is 0.95.

nsim

Number of simulations for the reference empirical distribution of the performance measure; default is 1000.

rmin.target

To merge the pair of groups with the minimum ratio of performance measures (ensemble of models to single model) into a single group their ratio must be less than rmin.target, otherwise merging stops; default is 1.

classifier

Base classifier, one of c("random forest", "logistic regression", "neural network"); default is "random forest", which uses randomForest.

classifier.args

Arguments for the base classifier specified in a list as follows: list(argName1 = value1, argName2 = value2, ...). If the list is empty, the classifier will use its defaults. For "random forest", user may specify replace, cutoff, nodesize, maxnodes. For "logistic regression" there are no options. For "neural network", user may specify size, trace.

performance

Performance assessment metric, one of c("AHR", "IE", "TOP1", "RKL"); default is AHR.

performance.args

Arguments for the performance measure specified in a list as follows: list(argName1 = value1, argName2 = value2, ...). If the list is empty, the performance measure will use its defaults. Currently, only IE takes an argument list, and its only argument is cutoff.

computing

Whether to compute sequentially or in parallel. Input is one of c("sequential", "parallel"); default is "sequential".

...

Further arguments passed to or from other methods.

Details

Please see Tomal et al. (2015) for more description of phalanx formation.

Value

Returns an object of class epx, which is a list containing the following components:

PHALANXES

List of four vectors, each the same length as the number of explanatory variables (columns in x): phalanxes.initial, phalanxes.filtered, phalanxes.merged, phalanxes.final. Each vector contains the phalanx membership indices of all explanatory variables at one of the four stages of phalanx-formation. Element i of a vector is the index of the phalanx to which variable i belongs. Phalanx 0 does not exist and so membership in phalanx 0 indicates that the variable does not belong to any phalanx; it has been screened out.

PHALANXES.FINAL.PERFORMANCE

Vector of performance measures of the final phalanxes: the first element is for phalanx 1, etc.

PHALANXES.FINAL.FITS

A matrix with number of rows equal to the number of observations in the training data and number of columns equal to the number of final phalanxes. Column i contains the predicted probabilities of class 1 from fitting the base classifier to the variables in phalanx i.

ENSEMBLED.FITS

The predicted probabilities of class 1 from the ensemble of phalanxes based on phalanxes.final.

BASE.CLASSIFIER.ARGS

(Parsed) record of user-specified arguments for classifier.

PERFORMANCE.ARGS

(Parsed) record of user-specified arguments for performance.

X

User-provided data frame of explanatory variables.

Y

User-provided binary response vector.

References

Tomal, J. H., Welch, W. J., & Zamar, R. H. (2015). Ensembling classification models based on phalanxes of variables with applications in drug discovery. The Annals of Applied Statistics, 9(1), 69-93. doi: 10.1214/14-AOAS778

See Also

summary.epx prints a summary of the results, and cv.epx assesses performance via cross-validation.

Examples

# Example with data(harvest)

## Phalanx-formation using a base classifier with 50 trees (default = 500)

set.seed(761)
model <- epx(x = harvest[, -4], y = harvest[, 4],
             classifier.args = list(ntree = 50))

## Phalanx-membership of explanatory variables at the four stages
## of phalanx formation (0 means not in a phalanx)
model$PHALANXES

## Summary of the final phalanxes (matches above)
summary(model)
## Not run: 
## Parallel computing
clusters <- parallel::detectCores()
cl <- parallel::makeCluster(clusters)
doParallel::registerDoParallel(cl)
set.seed(761)
model.par <- epx(x = harvest[, -4], y = harvest[, 4],
                 computing = "parallel")
parallel::stopCluster(cl)

## End(Not run)


[Package EPX version 1.0.4 Index]