ModTools-package {ModTools} | R Documentation |
Regression and Classification Tools
Description
There is a rich selection of R packages implementing algorithms for classification and regression tasks out there. The authors legitimately take the liberty to tailor the function interfaces according to their own taste and needs. For us other users, however, this often results in struggling with user interfaces, some of which are rather weird - to put it mildly - and almost always different in terms of arguments and result structures.
ModTools pursues the goal of offering uniform handling for the most important regression and classification models in applied data analyses.
The function FitMod()
is designed as a simple and consistent interface to these original functions while maintaining the flexibility to pass on all possible arguments. print
, plot
, summary
and predict
operations can so be carried out following the same logic. The results will again be reshaped to a reasonable standard.
For all the functions of this package Google styleguides are used as naming rules (in absence of convincing alternatives). The 'BigCamelCase' style has been consequently applied to functions borrowed from contributed R packages as well.
As always: Feedback, feature requests, bugreports and other suggestions are welcome!
Details
The ModTools::FitMod())
function comprises interfaces to the following models:
Regression: | |
lm() | Linear model OLS (base) |
lmrob() | Robust linear model (robustbase) |
poisson() | GLM model with family poisson (base) |
negbin() | GLM model with family negative.binomial (MASS) |
gamma() | GLM model with family gamma (base) |
tobit() | Tobit model for censored responses (package AER) |
Classification: | |
lda() | Linear discriminant analysis (MASS) |
qda() | Quadratic discriminant analysis (MASS) |
logit() | Logistic Regression model glm , family binomial(logit) (base) |
multinom() | Multinomial Regression model (nnet) |
polr() | Proportional odds model (MASS) |
rpart() | Regression and classification trees (rpart) |
nnet() | Neuronal networks (nnet) |
randomForest() | Random forests (randomForest) |
C5.0() | C5.0 tree (C50) |
svm() | Support vector machines (e1071) |
naive_bayes() | Naive Bayes classificator (naivebayes) |
LogitBoost() | Logit boost (using decision stumps as weak learners) (ModTools) |
Preprocess: | |
SplitTrainTest() | Splits a data frame or index vector into a training and a test sample |
OverSample() | Get balanced datasets by sampling with replacement. |
Manipulating rpart objects: | |
CP() | Extract and plot complexity table of an rpart tree. |
Node() | Accessor to the most important properties of a node, being a split or a leaf. |
Rules() | Extract the decision rules from top to the end node of an rpart tree. |
LeafRates() | Returns the misclassification rates in all end nodes. |
Prediction and Validation: | |
Response() | Extract the response variable of any model. |
predict() | Consistent predict for FitMod models |
VarImp() | Variable importance for most FitMod models |
ROC() | ROC curves for all dichotomous classification FitMod models |
BestCut() | Find the optimal cut for a classification based on the ROC curve. |
PlotLift() | Produces a lift chart for a binary classification model |
TModC() | Aggregated results for multiple FitMod classification models |
Tune() | Tuning approaches to find optimal parameters for FitMod classification models. |
RobSummary() | Robust summary for GLM models (poisson). |
Tests: | |
BreuschPaganTest() | Breusch-Pagan test against heteroskedasticity. |
Warning
This package is still under development. You should be aware that everything in the package might be subject to change. Backward compatibility is not yet guaranteed. Functions may be deleted or renamed and new syntax may be inconsistent with earlier versions. By release of version 1.0 the "deprecated-defunct process" will be installed.
Author(s)
Andri Signorell
Helsana Versicherungen AG, Health Sciences, Zurich
HWZ University of Applied Sciences in Business Administration Zurich.
Includes R source code and/or documentation previously published by (in alphabetical order):
Bernhard Compton, Marcel Dettling, Max Kuhn, Michal Majka, Dan Putler, Jarek Tuszynski, Robin Xavier, Achim Zeileis
The good things come from all these guys, any problems are likely due to my tweaking.
Thank you all!
Maintainer: Andri Signorell <andri@signorell.net>
Examples
r.swiss <- FitMod(Fertility ~ ., swiss, fitfn="lm")
r.swiss
# PlotTA(r.swiss)
# PlotQQNorm(r.swiss)
## Count models
data(housing, package="MASS")
# poisson count
r.pois <- FitMod(Freq ~ Infl*Type*Cont + Sat, family=poisson, data=housing, fitfn="poisson")
# negative binomial count
r.nb <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="negbin")
summary(r.nb)
r.log <- FitMod(log(Freq) ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm")
summary(r.log)
r.ols <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm")
summary(r.ols)
r.gam <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma")
summary(r.gam)
r.gami <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma", link="identity")
summary(r.gami)
old <-options(digits=3)
TMod(r.pois, r.nb, r.log, r.ols, r.gam, r.gami)
options(old)
## Ordered Regression
r.polr <- FitMod(Sat ~ Infl + Type + Cont, data=housing, fitfn="polr", weights = Freq)
# multinomial Regression
# r.mult <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing,
# weights = housing$Freq, fitfn="multinom")
# Regression tree
r.rp <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing,
weights = housing$Freq, fitfn="rpart")
# compare predictions
d.p <- expand.grid(Infl=levels(housing$Infl), Type=levels(housing$Type), Cont=levels(housing$Cont))
d.p$polr <- predict(r.polr, newdata=d.p)
# ??
# d.p$ols <- factor(round(predict(r.ols, newdata=d.p)^2), labels=levels(housing$Sat))
# d.p$mult <- predict(r.mult, newdata=d.p)
d.p$rp <- predict(r.rp, newdata=d.p, type="class")
d.p
# Classification with 2 classes ***************
r.pima <- FitMod(diabetes ~ ., d.pima, fitfn="logit")
r.pima
Conf(r.pima)
plot(ROC(r.pima))
OddsRatio(r.pima)
# rpart tree
rp.pima <- FitMod(diabetes ~ ., d.pima, fitfn="rpart")
rp.pima
Conf(rp.pima)
lines(ROC(rp.pima), col=hblue)
# to be improved
plot(rp.pima, col=SetAlpha(c("blue","red"), 0.4), cex=0.7)
# Random Forest
rf.pima <- FitMod(diabetes ~ ., d.pima, method="class", fitfn="randomForest")
rf.pima
Conf(rf.pima)
lines(ROC(r.pima), col=hred)
# more models to compare
d.pim <- SplitTrainTest(d.pima, p = 0.2)
mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps
+ insulin + mass + pedigree + age)
r.glm <- FitMod(mdiab, data=d.pim$train, fitfn="logit")
r.rp <- FitMod(mdiab, data=d.pim$train, fitfn="rpart")
r.rf <- FitMod(mdiab, data=d.pim$train, fitfn="randomForest")
r.svm <- FitMod(mdiab, data=d.pim$train, fitfn="svm")
r.c5 <- FitMod(mdiab, data=d.pim$train, fitfn="C5.0")
r.nn <- FitMod(mdiab, data=d.pim$train, fitfn="nnet")
r.nb <- FitMod(mdiab, data=d.pim$train, fitfn="naive_bayes")
r.lda <- FitMod(mdiab, data=d.pim$train, fitfn="lda")
r.qda <- FitMod(mdiab, data=d.pim$train, fitfn="qda")
r.lb <- FitMod(mdiab, data=d.pim$train, fitfn="lb")
mods <- list(glm=r.glm, rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5
, nn=r.nn, nb=r.nb, lda=r.lda, qda=r.qda, lb=r.lb)
# insight in the Regression tree
plot(r.rp, box.palette = as.list(Pal("Helsana", alpha = 0.5)))
# Insample accuracy ...
TModC(mods, ord="auc")
# ... is substantially different from the out-of-bag:
TModC(mods, newdata=d.pim$test, reference=d.pim$test$diabetes, ord="bs")
# C5 and SVM turn out to be show-offs! They overfit quite ordinary
# whereas randomforest and logit keep their promises. ...
sapply(mods, function(z) VarImp(z))
# Multinomial classification problem with n classes ***************
d.gl <- SplitTrainTest(d.glass, p = 0.2)
mglass <- formula(Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe)
# *** raises an unclear error in CRAN-Debian tests *** ??
# r.mult <- FitMod(mglass, data=d.gl$train, maxit=600, fitfn="multinom")
r.rp <- FitMod(mglass, data=d.gl$train, fitfn="rpart")
r.rf <- FitMod(mglass, data=d.gl$train, fitfn="randomForest")
r.svm <- FitMod(mglass, data=d.gl$train, fitfn="svm")
r.c5 <- FitMod(mglass, data=d.gl$train, fitfn="C5.0")
r.nn <- FitMod(mglass, data=d.gl$train, fitfn="nnet")
r.nbay <- FitMod(mglass, data=d.gl$train, fitfn="naive_bayes")
r.lda <- FitMod(mglass, data=d.gl$train, fitfn="lda")
# r.qda <- FitMod(mglass, data=d.glass, fitfn="qda")
r.lb <- FitMod(mglass, data=d.gl$train, fitfn="lb")
mods <- list(rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5,
nn=r.nn, nbay=r.nbay, lda=r.lda, lb=r.lb)
# confusion matrix and other quality measures can be calculated with Conf()
Conf(r.rf)
# we only extract the general accuracy
sapply(lapply(mods, function(z) Conf(z)), "[[", "acc")
# let's compare r.mult with a model without RI as predictor
# Conf(r.mult)
# Conf(update(r.mult, . ~ . -RI))