R: Calculating the generalization error of a model on a set of...

generalization_error {regclass}

R Documentation

Calculating the generalization error of a model on a set of data

Description

This function takes a linear regression from lm, logistic regression from glm, partition model from rpart, or random forest from randomForest and calculates the generalization error on a dataframe.

Usage

generalization_error(MODEL,HOLDOUT,Kfold=FALSE,K=5,R=10,seed=NA)

Arguments

`MODEL`	A linear regression model created using `lm`, a logistic regression model created using `glm`, a partition model created using `rpart`, or a random forest created using `randomForest`.
`HOLDOUT`	A dataset for which the generalization error will be calculated. If not given, the error on the data used to build the model (`MODEL`) is used.
`Kfold`	If `TRUE`, function will estimate the generalization error of `MODEL` using repeated K-fold cross validation (regression models only)
`K`	The number of folds used in repeated K-fold cross-validation for the estimation of the generalization error for the model `MODEL`. It is useful to compare this number to the actual generalization error on `HOLDOUT`.
`R`	The number of repeats used in repeated K-fold cross-validation.
`seed`	an optional argument priming the random number seed for estimating the generalization error

Details

This function calculates the error on MODEL, its estimated generalization error from repeated K-fold cross-validation (for regression models only), and the actual generalization error on HOLDOUT. If the response is quantitative, the RMSE is reported. If the response is categorical, the confusion matrices and misclassification rates are returned.

Author(s)

Adam Petrie

References

Introduction to Regression and Modeling

Examples


  #Education analytics
  data(STUDENT)
  set.seed(1010)
  train.rows <- sample(1:nrow(STUDENT),0.7*nrow(STUDENT))
  TRAIN <- STUDENT[train.rows,]
  HOLDOUT <- STUDENT[-train.rows,]
  M <- lm(CollegeGPA~.,data=TRAIN)
  #Also estimate the generalization error of the model
  generalization_error(M,HOLDOUT,Kfold=TRUE,seed=5020)
  #Try partition and randomforest, though they do not perform as well as regression here
  TREE <- rpart(CollegeGPA~.,data=TRAIN)
  FOREST <- randomForest(CollegeGPA~.,data=TRAIN)
  generalization_error(TREE,HOLDOUT)
  generalization_error(FOREST,HOLDOUT) 

  #Wine
  data(WINE)
  set.seed(2020)
  train.rows <- sample(1:nrow(WINE),0.7*nrow(WINE))
  TRAIN <- WINE[train.rows,]
  HOLDOUT <- WINE[-train.rows,]
  M <- glm(Quality~.^2,data=TRAIN,family=binomial)
  generalization_error(M,HOLDOUT)
  #Random forest predicts best on the holdout sample
  TREE <- rpart(Quality~.,data=TRAIN)
  FOREST <- randomForest(Quality~.,data=TRAIN)
  generalization_error(TREE,HOLDOUT)
  generalization_error(FOREST,HOLDOUT)

[Package regclass version 1.6 Index]