hisee {sgee} | R Documentation |
Hierarchical Stagewise Estimating Equations Implementation.
Description
Function to perform HiSEE, a Bi-Level Boosting / Functional Gradient Descent / Forward Stagewise regression in the grouped covariates setting using Generalized Estimating Equations
Usage
hisee(y, ...)
## S3 method for class 'formula'
hisee(formula, data = list(), clusterID, waves = NULL,
contrasts = NULL, subset, ...)
## Default S3 method:
hisee(y, x, waves = NULL, ...)
## S3 method for class 'fit'
hisee(y, x, family, clusterID, waves = NULL,
groupID = 1:ncol(x), corstr = "independence", alpha = NULL,
intercept = TRUE, offset = 0, control = sgee.control(maxIt = 200,
epsilon = 0.05, stoppingThreshold = min(length(y), ncol(x)) - intercept,
undoThreshold = 0), standardize = TRUE, verbose = FALSE, ...)
Arguments
y |
Vector of response measures that corresponds with modeling family given in 'family' parameter. 'y' is assumed to be the same length as 'clusterID' and is assumed to be organized into clusters as dictated by 'clusterID'. |
... |
Not currently used |
formula |
Object of class 'formula'; a symbolic description of the model to be fitted |
data |
Optional data frame containing the variables in the model. |
clusterID |
Vector of integers that identifies the clusters of response measures in 'y'. Data and 'clusterID' are assumed to 1) be of equal lengths, 2) sorted so that observations of a cluster are in contiguous rows, and 3) organized so that 'clusterID' is a vector of consecutive integers. |
waves |
An integer vector which identifies components in clusters.
The length of |
contrasts |
An optional list provided when using a formula.
similar to |
subset |
An optional vector specifying a subset of observations to be used in the fitting process. |
x |
Design matrix of dimension length(y) x nvars where each row is represents an obersvation of predictor variables. Assumed to be scaled. |
family |
Modeling family that describes the marginal distribution of the response. Assumed to be an object such as 'gaussian()' or 'poisson()' |
groupID |
Vector of integeres that identifies the groups of the covariates/coefficients (i.e. the columns of 'x'). 'x' and 'groupID' are assumed 1) to be of corresponding dimension, (i.e. ncol(x) == length(groupID)), 2) sorted so that groups of covariates are in contiguous columns, and 3) organized so that 'groupID' is a vector of consecutive integers. |
corstr |
A character string indicating the desired working correlation structure. The following are implemented : "independence" (default value), "exchangeable", and "ar1". |
alpha |
An initial guess for the correlation parameter value between -1 and 1 . If left NULL (the default), the initial estimate is 0. |
intercept |
Binary value indicating where an intercept term is to be included in the model for estimation. Default is to include an intercept. |
offset |
Vector of offset value(s) for the linear predictor. 'offset' is assumed to be either of length one, or of the same length as 'y'. Default is to have no offset. |
control |
A list of parameters used to contorl the path generation
process; see |
standardize |
A logical parameter that indicates whether or not
the covariates need to be standardized before fitting.
If standardized before fitting, the unstandardized
path is returned as the default, with a |
verbose |
Logical parameter indicating whether output should be produced while hisee is running. Default value is FALSE. |
Details
Function to implement HiSEE, a stagewise regression approach that is designed to perform hierarchical selection in the context of Generalized Estimating Equations. Given A response Y, design matrix X (excluding intercept) HiSEE generates a path of stagewise regression estimates for each covariate based on the provided step size epsilon. First an optimal group of covariates is identified, and then an optimal covariate within that group is selected and then updated in each iterative step.
The resulting path can then be analyzed to determine an optimal
model along the path of coefficient estimates. The
summary.sgee
function provides such functionality based on various
possible metrics, primarily focused on the Mean Squared Error.
Furthermore, the plot.sgee
function can be used to examine the
path of coefficient estimates versus the iteration number, or some
desired penalty.
Value
Object of class 'sgee' containing the path of coefficient estimates,
the path of scale estimates, the path of correlation parameter
estimates, and the iteration at which HiSEE terminated, and initial
regression
values including x
, y
, codefamily, clusterID
,
groupID
, offset
, epsilon
, and numIt
.
Note
Function to execute HiSEE Technique. Functionally equivalent to SEE when all elements in groupID are unique.
Author(s)
Gregory Vaughan
References
Vaughan, G., Aseltine, R., Chen, K., Yan, J., (2017). Stagewise Generalized Estimating Equations with Grouped Variables. Biometrics 73, 1332-1342. URL: http://dx.doi.org/10.1111/biom.12669, doi:10.1111/biom.12669.
Wolfson, J. (2011). EEBoost: A general method for prediction and variable selection based on estimating equations. Journal of the American Statistical Association 106, 296–305.
Tibshirani, R. J. (2015). A general framework for fast stagewise algorithms. Journal of Machine Learning Research 16, 2543–2588.
Examples
#####################
## Generate test data
#####################
## Initialize covariate values
p <- 50
beta <- c(rep(2,5),
c(1, 0, 1.5, 0, .5),
rep(0.5,5),
rep(0,p-15))
groupSize <- 5
numGroups <- length(beta)/groupSize
generatedData <- genData(numClusters = 50,
clusterSize = 4,
clusterRho = 0.6,
clusterCorstr = "exchangeable",
yVariance = 1,
xVariance = 1,
numGroups = numGroups,
groupSize = groupSize,
groupRho = 0.3,
beta = beta,
family = gaussian(),
intercept = 1)
## Perform Fitting by providing y and x values
coefMat1 <- hisee(y = generatedData$y, x = generatedData$x,
family = gaussian(),
clusterID = generatedData$clusterID,
groupID = generatedData$groupID,
corstr="exchangeable",
control = sgee.control(maxIt = 50, epsilon = 0.5))
## Perform Fitting by providing formula and data
genDF <- data.frame(generatedData$y, generatedData$x)
names(genDF) <- c("Y", paste0("Cov", 1:p))
coefMat2 <- hisee(formula(genDF), data = genDF,
family = gaussian(),
subset = Y<1,
waves = rep(1:4, 50),
clusterID = generatedData$clusterID,
groupID = generatedData$groupID,
corstr="exchangeable",
control = sgee.control(maxIt = 50, epsilon = 0.5))
par(mfrow = c(2,1))
plot(coefMat1)
plot(coefMat2)