cv_gspcr {gspcr} | R Documentation |
Cross-validation of Generalized Principal Component Regression
Description
Use K-fold cross-validation to decide on the number of principal components and the threshold value for GSPCR.
Usage
cv_gspcr(
dv,
ivs,
fam = c("gaussian", "binomial", "poisson", "baseline", "cumulative")[1],
thrs = c("LLS", "PR2", "normalized")[1],
nthrs = 10L,
npcs_range = 1L:3L,
K = 5,
fit_measure = c("F", "LRT", "AIC", "BIC", "PR2", "MSE")[1],
max_features = ncol(ivs),
min_features = 1,
oneSE = TRUE,
save_call = TRUE
)
Arguments
dv |
numeric vector or factor of dependent variable values |
ivs |
|
fam |
character vector of length 1 storing the description of the error distribution and link function to be used in the model |
thrs |
character vector of length 1 storing the type of threshold to be used (see below for available options) |
nthrs |
numeric vector of length 1 storing the number of threshold values to be used |
npcs_range |
numeric vector defining the numbers of principal components to be used |
K |
numeric vector of length 1 storing the number of folds for the K-fold cross-validation procedure |
fit_measure |
character vector of length 1 indicating the type of fit measure to be used in the cross-validation procedure |
max_features |
numeric vector of length 1 indicating the maximum number of features that can be selected |
min_features |
numeric vector of length 1 indicating the minimum number of features that should be selected |
oneSE |
logical value indicating whether the results with the 1se rule should be saved |
save_call |
logical value indicating whether the call should be saved and returned in the results |
Details
The variables in ivs
do not need to be standardized beforehand as the function handles scaling appropriately based on the measurement levels of the data.
The fam
argument is used to define which model will be used when regressing the dependent variable on the principal components:
-
gaussian
: fits a linear regression model (continuous dv) -
binomial
: fits a logistic regression model (binary dv) -
poisson
: fits a poisson regression model (count dv) -
baseline
: fits a baseline-category logit model (nominal dv, usingnnet::multinom()
) -
cumulative
: fits a proportional odds logistic regression (ordinal dv, usingMASS::polr()
)
The thrs
argument defines the bivariate association-threshold measures used to determine the active set of predictors for a SPCR analysis.
The following association measures are supported (measurement levels allowed reported between brackets):
-
LLS
: simple GLM regression likelihoods (any dv with any iv) -
PR2
: Cox and Snell generalized R-squared is computed for the GLMs betweendv
and every column inivs
. Then, the square root of these values is used to obtain the threshold values. For more information about the computation of the Cox and Snell R2 see the help file forcp_gR2()
. When using this measure for simple linear regressions (with continuousdv
andivs
) is equivalent to the regular R-squared. Therefore, it can be thought of as equivalent to the bivariate correlations betweendv
andivs
. (any dv with any iv) -
normalized
: normalized correlation based onsuperpc::superpc.cv()
(continuous dv with continuous ivs)
The fit_measure
argument defines which fit measure should be used within the cross-validation procedure.
The supported measures are:
-
F
: F-statistic computed withcp_F()
(continuous dv) -
LRT
: likelihood-ratio test statistic computed withcp_LRT()
(any dv) -
AIC
: Akaike's information criterion computed withcp_AIC()
(any dv) -
BIC
: bayesian information criterion computed withcp_BIC()
(any dv) -
PR2
: Cox and Snell generalized R-squared computed withcp_gR2()
(any dv) -
MSE
: Mean squared error compute withMLmetrics::MSE()
(continuous dv)
Details regarding the 1 standard error rule implemented here can be found in the documentation for the function cv_choose()
.
Value
Object of class gspcr
, which is a list containing:
-
solution
: a list containing the number of PCs that was selected (Q), the threshold value used, and the resulting active set for both thestandard
andoneSE
solutions -
sol_table
: data.frame reporting the threshold number, value, and the number of PCs identified by the procedure -
thr
: vector of threshold values of the requested type used for the K-fold cross-validation procedure -
thr_cv
: numeric vector of length 1 indicating the threshold number that was selected by the K-fold cross-validation procedure using the default decision rule -
thr_cv_1se
: numeric vector of length 1 indicating the threshold number that was selected by the K-fold cross-validation procedure using the 1-standard-error rule -
Q_cv
: numeric vector of length 1 indicating the number of PCs that was selected by the K-fold cross-validation procedure using the default decision rule -
Q_cv_1se
: numeric vector of length 1 indicating the number of PCs that was selected by the K-fold cross-validation procedure using the 1-standard-error rule -
scor
:npcs \times nthrs
matrix of fit-measure scores averaged across the K folds -
scor_lwr
:npcs \times nthrs
matrix of fit-measure score lower bounds averaged across the K folds -
scor_upr
:npcs \times nthrs
matrix of fit-measure score upper bounds averaged across the K folds -
pred_map
: matrix ofp \times nthrs
logical values indicating which predictors were active for every threshold value used -
gspcr_call
: the function call
Author(s)
Edoardo Costantini, 2023
References
Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101(473), 119-137.
Examples
# Example input values
dv <- mtcars[, 1]
ivs <- mtcars[, -1]
thrs <- "PR2"
nthrs <- 5
fam <- "gaussian"
npcs_range <- 1:3
K <- 3
fit_measure <- "F"
max_features <- ncol(ivs)
min_features <- 1
oneSE <- TRUE
save_call <- TRUE
# Example usage
out_cont <- cv_gspcr(
dv = GSPCRexdata$y$cont,
ivs = GSPCRexdata$X$cont,
fam = "gaussian",
nthrs = 5,
npcs_range = 1:3,
K = 3,
fit_measure = "F",
thrs = "normalized",
min_features = 1,
max_features = ncol(GSPCRexdata$X$cont),
oneSE = TRUE,
save_call = TRUE
)