R: Random generalized linear model predictor thinning

thinRandomGLM {randomGLM}

R Documentation

Random generalized linear model predictor thinning

Description

This function allows the user to define a "thinned" version of a random generalized linear model predictor by focusing on those features that occur relatively frequently.

Usage

thinRandomGLM(rGLM, threshold)

Arguments

`rGLM`	a `randomGLM` object such as one returned by `randomGLM`.
`threshold`	integer specifying the minimum of times a feature was selected across the bags in `rGLM` for the feature to be kept. Note that only features selected `threshold +1` times and more are retained. For the purposes of this count, appearances in interactions are not counted. Features that appear `threshold` times or fewer are removed from the underlying regression models when the models are re-fit.

Details

The function "thins out" (reduces) a previously-constructed random generalized linear model predictor by removing rarely selected features and refitting each (generalized) linear model (GLM). Each GLM (per bag) is refit using only those features that occur more than threshold times across the nBags number of bags. The occurrence count excludes interactions (in other words, the threshold will be applied to the first row of timesSelectedByForwardRegression).

Value

The function returns a valid randomGLM object (see randomGLM for details) that can be used as input to the predict() method (see predict.randomGLM). The returned object contains a copy of the input rGLM in which the following components were modified:

`predictedOOB`	the updated continuous prediction (if `classify` is `FALSE`) or predicted classification (if `classify` is `TRUE`) of the input data based on out-of-bag samples.
`predictedOOB.response`	In case of a binary outcome, the updated predicted probability of each outcome specified by `y` based on out-of-bag samples. In case of a continuous outcome, this is the predicted value based on out-of-bag samples (i.e., a copy of `predictedOOB`).
`featuresInForwardRegression`	features selected by forward selection in each bag. A list with one component per bag. Each component is a matrix with `maxInteractionOrder` rows. Each column represents one interaction obtained by multiplying the features indicated by the entries in each column (0 means no feature, i.e. a lower order interaction).
`coefOfForwardRegression`	coefficients of forward regression. A list with one component per bag. Each component is a vector giving the coefficients of the model determined by forward selection in the corresponding bag. The order of the coefficients is the same as the order of the terms in the corresponding component of `featuresInForwardRegression`.
`interceptOfForwardRegression`	a vector with one component per bag giving the intercept of the regression model in each bag.
`timesSelectedByForwardRegression`	a matrix of `maxInteractionOrder` rows and number of features columns. Each entry gives the number of times the corresponding feature appeared in a predictor model at the corresponding order of interactions. Interactions where a single feature enters more than once (e.g., a quadratic interaction of the feature with itself) are counted once.
`models`	the "thinned" regression models for each bag.

Author(s)

Lin Song, Steve Horvath, Peter Langfelder

References

Lin Song, Peter Langfelder, Steve Horvath: Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics (2013)

Examples


## binary outcome prediction
# data generation
data(iris)
# Restrict data to first 100 observations
iris=iris[1:100,]
# Turn Species into a factor
iris$Species = as.factor(as.character(iris$Species))
# Select a training and a test subset of the 100 observations
set.seed(1)
indx = sample(100, 67, replace=FALSE)
xyTrain = iris[indx,]
xyTest = iris[-indx,]
xTrain = xyTrain[, -5]
yTrain = xyTrain[, 5]

xTest = xyTest[, -5]
yTest = xyTest[, 5]

# predict with a small number of bags - normally nBags should be at least 100.
RGLM = randomGLM(
   xTrain, yTrain, 
   nCandidateCovariates=ncol(xTrain), 
   nBags=30, 
   keepModels = TRUE, nThreads = 1)
table(RGLM$timesSelectedByForwardRegression[1, ])
# 0  7 23 
# 2  1  1 

thinnedRGLM = thinRandomGLM(RGLM, threshold=7)
predicted = predict(thinnedRGLM, newdata = xTest, type="class")
predicted = predict(RGLM, newdata = xTest, type="class")

[Package randomGLM version 1.10-1 Index]