model.build {ModelMap} | R Documentation |
Model Building
Description
Create sophisticated models using Random Forest, Quantile Regression Forests, Conditional Forests, or Stochastic Gradient Boosting from training data
Usage
model.build(model.type = NULL, qdata.trainfn = NULL, folder = NULL,
MODELfn = NULL, predList = NULL, predFactor = FALSE, response.name = NULL,
response.type = NULL, unique.rowname = NULL, seed = NULL, na.action = NULL,
keep.data = TRUE, ntree = switch(model.type,RF=500,QRF=1000,CF=500,500),
mtry = switch(model.type,RF=NULL,QRF=ceiling(length(predList)/3),
CF = min(5,length(predList)-1),NULL), replace = TRUE, strata = NULL,
sampsize = NULL, proximity = FALSE, importance=FALSE,
quantiles=c(0.1,0.5,0.9), subset = NULL, weights = NULL,
controls = NULL, xtrafo = NULL, ytrafo = NULL, scores = NULL)
Arguments
model.type |
String. Model type. |
qdata.trainfn |
String. The name (full path or base name with path specified by |
folder |
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If |
MODELfn |
String. The file name to use to save files related to the model object. If |
predList |
String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the If both |
predFactor |
String. A character vector of predictor short names of the predictors from |
response.name |
String. The name of the response variable used to build the model. If |
response.type |
String. Response type: |
unique.rowname |
String. The name of the unique identifier used to identify each row in the training data. If |
seed |
Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If |
na.action |
String. Model validation. Specifies the action to take if there are |
keep.data |
Logical. RF and SGB models. Should a copy of the predictor data be included in the model object. Useful for if |
ntree |
Integer. RF QRF and CF models. The number of random forest trees for a RF model. The default is 500 trees. |
mtry |
Integer. RF QRF and CF models. Number of variables to try at each node of Random Forest trees. By default, RF models will use the |
replace |
Logical. RF models. Should sampling of cases be done with or without replacement? |
strata |
Factor or String. RF models. A (factor) variable that is used for stratified sampling. Can be in the form of either the name of the column in |
sampsize |
Vector. RF models. Size(s) of sample to draw. For classification, if |
proximity |
Logical. RF models. Should proximity measure among the rows be calculated for unsupervised models? |
importance |
Logical. QRF models. For QRF models only, importance must be specified at the time of model building. If TRUE importance of predictors is assessed at the given |
quantiles |
Numeric. Used for QRF models if |
subset |
CF models. An optional vector specifying a subset of observations to be used in the fitting process. Note: |
weights |
CF models. An optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities |
controls |
CF models. An object of class |
xtrafo |
CF models. A function to be applied to all input variables. By default, the |
ytrafo |
CF models. A function to be applied to all response variables. By default, the |
scores |
CF models. An optional named list of scores to be attached to ordered factors. Note: |
Details
This package provides a push button approach to complex model building and production mapping. It contains three main functions: model.build
,model.diagnostics
, and model.mapmake
.
In addition it contains a simple function get.test
that can be used to randomly divide a training dataset into training and test/validation sets; build.rastLUT
that uses GUI prompts to walk a user through the process of setting up a Raster look up table to link predictors from the training data with the rasters used for map contruction; model.explore
, for preliminary data exploration; and, model.importance.plot
and model.interaction.plot
for interpreting the effects of individual model predictors.
These functions can be run in a traditional R command mode, where all arguments are specified in the function call. However they can also be used in a full push button mode, where you type in, for example, the simple command model.build
, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...
When running the ModelMap
package on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list()
function, which is platform independent.
Binary, categorical, and continuous response models are supported for Random Forest and Conditional Forest. Quantile Random Forest is appropriate for only continuous response models.
Random Forest is implemented through the randomForest
package within R
. Random Forest is more user friendly than Stochastic Gradient Boosting, as it has fewer parameters to be set by the user, and is less sensitive to tuning of these parameters. A Random Forest model consists of multiple trees that vote on predictions. For each tree a random subset of the training data is used to construct the tree, with the remaining data points used to construct out-of-bag (OOB) error estimates. At each node of the tree a random selection of predictors is chosen to determine the split. The number of predictors used to select the splits (argument mtry
) is the primary user specified parameter that can affect model performance.
By default mtry
will be automatically optimized using the randomForest
package tuneRF()
function. Note that this is a stochastic process. If there is a chance that models may be combined later with the randomForest
package combine
function then for consistency it is important to provide the mtry
argument rather that using the default optimization process.
Random Forest will not over fit data, therefore the only penalty of increasing the number of trees is computation time. Random Forest can compute variable importance, an advantage over some "black box" modeling techniques if it is important to understand the ecological relationships underlying a model (Brieman, 2001).
Quantile Regression Forests is implemented through the quantregForest
package.
Conditional Forests is implemented with the cforest()
function in the party
package. As stated in the party
package, ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental.
For CF models, ModelMap
currently only supports binary, categorical and continuous response models. Also, for some CF model parameters (subset
, weights
, and scores
) ModelMap
only provides OOB and independent test set diagnostics, and does not support cross validation diagnostics.
Stochastic gradient boosting is not currently supported by ModelMap
.
Value
The function will return the model object. Additionally, it will write a text file to disk, in the folder specified by folder
. This file lists the values of each argument as chosen from GUI prompts used for the function call.
Author(s)
Elizabeth Freeman and Tracey Frescino
References
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.
Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
N. Meinshausen (2006) "Quantile Regression Forests", Journal of Machine Learning Research 7, 983-999 http://jmlr.csail.mit.edu/papers/v7/
Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.co,/1471-2105/8/25
Carolin Strobl, James Malley and Gerhard Tutz (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Phsycological Methods, 14(4), 323-348.
Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77-91.
Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro and Mark J. ven der Laan (2006a). Survival Ensembles. Biostatistics, 7(3), 355-373.
Torston Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651-674. Preprint available from http://statmath.wu-wein.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
See Also
get.test
, model.diagnostics
, model.mapmake
Examples
## Not run:
###########################################################################
############################# Run this set up code: #######################
###########################################################################
# set seed:
seed=38
# Define training and test files:
qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")
# Define folder for all output:
folder=getwd()
#identifier for individual training and test data points
unique.rowname="ID"
###########################################################################
############## Pick one of the following sets of definitions: #############
###########################################################################
########## Continuous Response, Continuous Predictors ############
#file name:
MODELfn="RF_Bio_TC"
#predictors:
predList=c("TCB","TCG","TCW")
#define which predictors are categorical:
predFactor=FALSE
# Response name and type:
response.name="BIO"
response.type="continuous"
########## binary Response, Continuous Predictors ############
#file name to store model:
MODELfn="RF_CONIFTYP_TC"
#predictors:
predList=c("TCB","TCG","TCW")
#define which predictors are categorical:
predFactor=FALSE
# Response name and type:
response.name="CONIFTYP"
# This variable is 1 if a conifer or mixed conifer type is present,
# otherwise 0.
response.type="binary"
########## Continuous Response, Categorical Predictors ############
# In this example, NLCD is a categorical predictor.
#
# You must decide what you want to happen if there are categories
# present in the data to be predicted (either the validation/test set
# or in the image file) that were not present in the original training data.
# Choices:
# na.action = "na.omit"
# Any validation datapoint or image pixel with a value for any
# categorical predictor not found in the training data will be
# returned as NA.
# na.action = "na.roughfix"
# Any validation datapoint or image pixel with a value for any
# categorical predictor not found in the training data will have
# the most common category for that predictor substituted,
# and the a prediction will be made.
# You must also let R know which of the predictors are categorical, in other
# words, which ones R needs to treat as factors.
# This vector must be a subset of the predictors given in predList
#file name to store model:
MODELfn="RF_BIO_TCandNLCD"
#predictors:
predList=c("TCB","TCG","TCW","NLCD")
#define which predictors are categorical:
predFactor=c("NLCD")
# Response name and type:
response.name="BIO"
response.type="continuous"
###########################################################################
########################### build model: ##################################
###########################################################################
### create model before batching (only run this code once ever!) ###
model.obj = model.build( model.type="RF",
qdata.trainfn=qdata.trainfn,
folder=folder,
unique.rowname=unique.rowname,
MODELfn=MODELfn,
predList=predList,
predFactor=predFactor,
response.name=response.name,
response.type=response.type,
seed=seed,
na.action="na.roughfix"
)
## End(Not run) # end dontrun