ModelTrain {chemmodlab}R Documentation

Fit predictive models to sets of descriptors.

Description

ModelTrain is a generic S3 function that fits a series of classification or regression models to sets of descriptors and computes cross-validated measures of model performance.

Usage

ModelTrain(...)

## Default S3 method:
ModelTrain(
  x,
  y,
  nfolds = 10,
  nsplits = 3,
  seed.in = NA,
  des.names = NA,
  models = c("NNet", "PLS", "LAR", "Lasso", "PLSLDA", "Tree", "SVM", "KNN", "RF"),
  user.params = NULL,
  verbose = FALSE,
  ...
)

## S3 method for class 'data.frame'
ModelTrain(
  d,
  ids = FALSE,
  xcol.lengths = ifelse(ids, length(d) - 2, length(d) - 1),
  xcols = NA,
  nfolds = 10,
  nsplits = 3,
  seed.in = NA,
  des.names = NA,
  models = c("NNet", "PLS", "LAR", "Lasso", "PLSLDA", "Tree", "SVM", "KNN", "RF"),
  user.params = NULL,
  verbose = FALSE,
  ...
)

Arguments

...

Additional parameters.

x

a list of numeric descriptor set matrices. At the moment, only binary and continuous descriptors are supported. Binary descriptors should be numeric (0 or 1).

y

a numeric vector containing the binary or continuous response.

nfolds

the number of folds to use for each cross validation split.

nsplits

the number of splits to use for repeated cross validation.

seed.in

a numeric vector with length equal to nsplits. The seeds are used to randomly assign folds to observations for each repeated cross-validation split. If NA, the first seed will be 11111, the second will be 22222, and so on.

des.names

a character vector specifying the names for each descriptor set. The length of the vector must match the number of descriptor sets. If NA, each descriptor set will be named "Descriptor Set i", where i is the number of the descriptor set.

models

a character vector specifying the regression or classification models to use. The strings must match models implemented in 'chemmodlab' (see Details).

user.params

a list of data frames where each data frame contains the parameter values for a model. The list should have the format of the list constructed by MakeModelDefaults. One can construct a list of parameters using MakeModelDefaults and then modify the parameters.

verbose

verbose mode or not?

d

a data frame containing an (optional) ID column, a response column, and descriptor columns. The columns should be provide in this order.

ids

a logical. Is an ID column provided?

xcol.lengths

a vector of integers. It is assumed that the columns in d are grouped by descriptor set. The integers specify the number of descriptors in each descriptor set. They should be ordered as the descriptor sets are ordered in d. Users can specify multiple descriptor sets. By default there is one descriptor set, namely all columns in d except the response column and the optional ID column. Specify xcol.lengths or xcols, but not both.

xcols

A list of integer vectors. Each vector contains column indices of data where a set of descriptor variables is located. Users can specify multiple descriptor sets. Specify xcol.lengths or xcols, but not both.

Details

Multiple descriptor sets can be specified by the user. For each descriptor set, repeated k-fold cross validation is performed for the specified regression and/or classification models.

Not all modeling strategies will be appropriate for all response types. For example, partial least squares linear discriminant analysis ("PLSLDA") is not directly appropriate for continuous response assays such as percent inhibition, but it can be applied once a threshold value for percent inhibition is used to create a binary (active/inactive) response.

See https://jrash.github.io/chemmodlab/ for more information about the models available (including model default parameters). The default value for argument models includes only some of the possible values.

Sensible default values are selected for each tunable model parameter, however users may set any parameter manually using MakeModelDefaults and user.params.

ModelTrain predictions are based on k-fold cross-validation, where the dataset is randomly divided into k parts, each containing approximately equal numbers of compounds. Treating one of these parts as a "test set" the remaining k-1 parts are combined together as a "training set" and used to build a model from the desired modeling technique and descriptor set. This model is then applied to the "test set" to obtain predictions. The process is repeated, holding out each of the k parts in turn. One advantage of k-fold cross-validation is reduction in bias from using the same data to both build and assess a model. Another advantage is the increased precision of error estimation offered by k-fold cross validation over a one-time split.

Recognizing that the definition of folds in k-fold cross validation may have an impact on the observed performance measures, all models are built using the same definition of folds. This process is repeated to obtain multiple separate k-fold cross validation runs resulting in multiple separate definitions of folds. The number of these "splits" is specified by nsplits.

Observed performance measures are assessed across all splits using CombineSplits. This function assesses how sensitive performance measures are to fold assignments, or changes to the training and test sets. Statistical tests are used to determine the best performing model and descriptor set combination.

Value

A list is returned of class chemmodlab containing:

all.preds

a list of lists of data frames. The elements of the outer list correspond to each CV split performed by ModelTrain. The elements of the inner list correspond to each descriptor set. For each descriptor set and CV split combination, the output is a dataframe containing all model predictions. The first column of each data frame contains the true value of the response. The remaining columns contain the predictions for each model.

all.probs

a list of lists of data frames. Constructed only if there is a binary response. The structure is the same as all.preds, except that predictions are replaced by "predicted probabilities" (i.e. estimated probabilities of a response value of one). Predicted probabilities are only reported for classification models.

model.acc

a list of lists of model accuracy measures. The elements of the outer list correspond to each CV split performed by ModelTrain. The elements of the inner list correspond to each descriptor set. For each descriptor set and CV split combination, a limited collection of performance measures are given for each model fit to the data. Regression models are assessed with Pearson's r and RMSE. Classification models are assessed with contingency tables. For additional model performance measures, see Performance

.

classify

a logical. Were classification models used for binary response?

responses

a numeric vector. The observed value of the response.

data

a list of numeric matrices. Each matrix is a descriptor set used as model input.

params

a list of data frames as made by MakeModelDefaults. Each data frame contains the parameters to be set for a particular model.

des.names

a character vector specifying the descriptor set names. NA if unspecified.

models

a character vector specifying the models fit to the data.

nsplits

number of CV splits performed.

Methods (by class)

Author(s)

Jacqueline Hughes-Oliver, Jeremy Ash, Atina Brooks

See Also

chemmodlab, plot.chemmodlab, CombineSplits,

Examples


## Not run: 
# A data set with  binary response and multiple descriptor sets
data(aid364)

cml <- ModelTrain(aid364, ids = TRUE, xcol.lengths = c(24, 147),
                  des.names = c("BurdenNumbers", "Pharmacophores"))
cml

## End(Not run)

# A continuous response
cml <- ModelTrain(USArrests, nsplits = 2, nfolds = 2,
                  models = c("KNN", "Lasso", "Tree"))
cml


[Package chemmodlab version 2.0.0 Index]