tunecpfa {cpfa} | R Documentation |
Tuning for Classification with Parallel Factor Analysis
Description
Fits Richard A. Harshman's Parallel Factor Analysis-1 (Parafac) model or Parallel Factor Analysis-2 (Parafac2) model to a three-way or four-way data array. Allows for multiple constraint options on tensor modes. Uses component weights from a single mode of the model as predictors to tune parameters for one or more classification methods via a k-fold cross-validation procedure. Supports binary and multiclass classification.
Usage
tunecpfa(x, y, model = c("parafac", "parafac2"), nfac = 1, nfolds = 10,
method = c("PLR", "SVM", "RF", "NN", "RDA", "GBM"),
family = c("binomial", "multinomial"), parameters = list(),
foldid = NULL, prior = NULL, cmode = NULL, parallel = FALSE,
cl = NULL, verbose = TRUE, ...)
Arguments
x |
For Parafac or Parafac2, a three-way or four-way data array. For Parafac2, can be a list of length |
y |
A vector containing at least two unique class labels. Should be a factor that contains two or more levels . For binary case, ensure the order of factor levels (left to right) is such that negative class is first and positive class is second. |
model |
Character designating the Parafac model to use, either |
nfac |
Number of components for each Parafac or Parafac2 model to fit. Default is |
nfolds |
Numeric setting number of folds for k-fold cross-validation. Must be 2 or greater. Default is |
method |
Character vector indicating classification methods to use. Possible methods include penalized logistic regression (PLR); support vector machine (SVM); random forest (RF); feed-forward neural network (NN); regularized discriminant analysis (RDA); and gradient boosting machine (GBM). If none selected, default is to use all methods. |
family |
Character value specifying binary classification ( |
parameters |
List containing arguments related to classification methods. When specified, must contain one or more of the following:
|
foldid |
Vector containing fold IDs for k-fold cross-validation. Can be of class integer, numeric, or data frame. Should contain integers from 1 through the number of folds. If not provided, fold IDs are generated randomly for observations using 1 through the number of folds |
prior |
Prior probabilities of class membership. If unspecified, the class proportions for input |
cmode |
Integer value of 1, 2, or 3 (or 4 if |
parallel |
Logical indicating if parallel computing should be implemented. If TRUE, the package parallel is used for parallel computing. For all classification methods except penalized logistic regression, the doParallel package is used as a wrapper. Defaults to FALSE, which implements sequential computing. |
cl |
Cluster for parallel computing, which is used when |
verbose |
If TRUE, progress is printed. |
... |
Additional arguments to be passed to function |
Details
After fitting a Parafac or Parafac2 model with package multiway (see parafac
or parafac2
in multiway for details), the estimated classification mode weight matrix is passed to one or several of six classification methods–including penalized logistic regression (PLR); support vector machine (SVM); random forest (RF); feed-forward neural network (NN); regularized discriminant analysis (RDA); and gradient boosting machine (GBM).
Package glmnet fits models for PLR. PLR tunes penalty parameter lambda while the elastic net parameter alpha is set by the user (see the help file for function cv.glmnet
in package glmnet). For SVM, package e1071 is used with a radial basis kernel. Penalty parameter cost and radial basis parameter gamma are used (see svm
in package e1071). For RF, package randomForest is used and implements Breiman's random forest algorithm. The number of predictors sampled at each node split is set at the default of sqrt(R), where R is the number of Parafac or Parafac2 components. Two tuning parameters allowed are ntree, the number of trees to be grown, and nodesize, the minimum size of terminal nodes (see randomForest
in package randomForest). For NN, package nnet fits a single-hidden-layer, feed-forward neural network model. Penalty parameters size (i.e., number of hidden layer units) and decay (i.e., weight decay) are used (see nnet). For RDA, package rda fits a shrunken centroids regularized discriminant analysis model. Tuning parameters include rda.alpha, the shrinkage penalty for the within-class covariance matrix, and delta, the shrinkage penalty of class centroids towards the overall dataset centroid. For GBM, package xgboost fits a gradient boosting machine model. Four tuning parameters are allowed: (1) eta, the learning rate; (2) max.depth, the maximum tree depth; (3) subsample, the fraction of samples per tree; and (4) nrounds, the number of boosting trees to build.
For all six methods, k-fold cross-validation is implemented to tune classification parameters where the number of folds is set by argument nfolds
.
Value
Returns an object of class tunecpfa
with the following elements:
opt.model |
List containing optimal model for tuned classification methods for each Parafac or Parafac2 model that was fit. |
opt.param |
Data frame containing optimal parameters for tuned classification methods. |
kcv.error |
Data frame containing KCV misclassification error for optimal parameters for tuned classification methods. |
est.time |
Data frame containing times for fitting Parafac or Parafac2 model and for tuning classification methods. |
method |
Numeric indicating classification methods used. Value of '1' indicates 'PLR'; value of '2' indicates 'SVM'; value of '3' indicates 'RF'; value of '4' indicates 'NN'; value of '5' indicates 'RDA'; and value of '6' indicates 'GBM'. |
x |
Three-way or four-way array used. If a list was used with |
y |
Factor containing class labels used. Note that output |
Aweights |
List containing estimated A weights for each Parafac or Parafac2 model that was fit. |
Bweights |
List containing estimated B weights for each Parafac or Parafac2 model that was fit. |
Cweights |
List containing estimated C weights for each Parafac or Parafac2 model that was fit. Null if inputted argument |
Phi |
If |
const |
Constraints used in fitting Parafac or Parafac2 models. If argument |
cmode |
Integer value of 1, 2, or 3 (or 4 if |
family |
Character value specifying whether classification was binary ( |
xdim |
Numeric value specifying number of levels for each mode of input |
lxdim |
Numeric value specifying number of modes of input |
train.weights |
List containing classification component weights for each fit Parafac or Parafac2 model, for possibly different numbers of components. The weights used to train classifiers. |
Note
For fitting the Parafac model, if argument cmode
is not null, input array x
is reshaped with function aperm
such that the cmode
dimension of x
is ordered last. Estimated mode A and B (and mode C for a four-way array) weights that are outputted as Aweights
and Bweights
(and Cweights
) reflect this permutation. For example, if x
is a four-way array and cmode = 2
, the original input modes 1, 2, 3, and 4 will correspond to output modes 1, 3, 4, 2. Here, output A = input 1; B = 3, and C = 4 (i.e., the second mode specified by cmode
has been moved to the D mode/last mode). For model = "parafac2"
, classification mode is assumed to be the last mode (i.e., mode C for three-way array and mode D for four-way array).
In addition, note that the following combination of arguments will give an error: nfac = 1, family = "multinomial", method = "PLR"
. The issue arises from providing glmnet::cv.glmnet
input x
a matrix with a single column. The issue is resolved for family = "binomial"
because a column of 0s is appended to the single column, but this solution does not appear to work for the multiclass case. As such, this combination of arguments is not currently allowed. This issue will be resolved in a future update.
Author(s)
Matthew A. Snodgress <snodg031@umn.edu>
References
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y., Yuan, J. (2024). xgboost: Extreme gradient boosting. R Package Version 1.7.7.1.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405), 165-175.
Friedman, J. Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Guo, Y., Hastie, T., and Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8(1), 86-100.
Guo Y., Hastie T., and Tibshirani, R. (2023). rda: Shrunken centroids regularized discriminant analysis. R Package Version 1.2-1.
Harshman, R. (1970). Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multimodal factor analysis. UCLA Working Papers in Phonetics, 16, 1-84.
Harshman, R. (1972). PARAFAC2: Mathematical and technical notes. UCLA Working Papers in Phonetics, 22, 30-44.
Harshman, R. and Lundy, M. (1994). PARAFAC: Parallel factor analysis. Computational Statistics and Data Analysis, 18, 39-72.
Helwig, N. (2017). Estimating latent trends in multivariate longitudinal data via Parafac2 with functional and structural constraints. Biometrical Journal, 59(4), 783-803.
Helwig, N. (2019). multiway: Component models for multi-way data. R Package Version 1.0-6.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News 2(3), 18–22.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2023). e1071: Misc functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R Package Version 1.7-13.
Ripley, B. (1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society: Series B (Methodological), 56(3), 409-437.
Venables, W. and Ripley, B. (2002). Modern applied statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Examples
########## Parafac example with 3-way array and binary response ##########
# set seed and specify dimensions of a three-way tensor
set.seed(3)
mydim <- c(10, 11, 80)
nf <- 3
# create correlation matrix between response and third mode's weights
rho.cc <- .35
rho.cy <- .75
cormat.values <- c(1, rho.cc, rho.cc, rho.cy, rho.cc, 1, rho.cc, rho.cy,
rho.cc, rho.cc, 1, rho.cy, rho.cy, rho.cy, rho.cy, 1)
cormat <- matrix(cormat.values, nrow = (nf + 1), ncol = (nf + 1))
# sample from a multivariate normal with specified correlation structure
ymean <- Cmean <- 2
mu <- as.matrix(c(Cmean, Cmean, Cmean, ymean))
eidecomp <- eigen(cormat, symmetric = TRUE)
L.sqrt <- diag(eidecomp$values^0.5)
cormat.sqrt <- eidecomp$vectors %*% L.sqrt %*% t(eidecomp$vectors)
Z <- matrix(rnorm(mydim[3] * (nf + 1)), nrow = mydim[3], ncol = (nf + 1))
Xw <- rep(1, mydim[3]) %*% t(mu) + Z %*% cormat.sqrt
Cmat <- Xw[, 1:nf]
# create a random three-way data tensor with C weights related to a response
Amat <- matrix(rnorm(mydim[1] * nf), nrow = mydim[1], ncol = nf)
Bmat <- matrix(runif(mydim[2] * nf), nrow = mydim[2], ncol = nf)
Xmat <- tcrossprod(Amat, krprod(Cmat, Bmat))
Xmat <- array(Xmat, dim = mydim)
Emat <- array(rnorm(prod(mydim)), dim = mydim)
Emat <- nscale(Emat, 0, ssnew = sumsq(Xmat))
X <- Xmat + Emat
# create a binary response by dichotomizing at the specified response mean
y <- factor(as.numeric(Xw[ , (nf + 1)] > ymean))
# initialize
alpha <- seq(0, 1, length = 2)
gamma <- c(0, 0.01)
cost <- c(1, 2)
ntree <- c(100, 200)
nodesize <- c(1, 2)
size <- c(1, 2)
decay <- c(0, 1)
rda.alpha <- c(0.1, 0.6)
delta <- c(0.1, 2)
eta <- c(0.3, 0.7)
max.depth <- c(1, 2)
subsample <- c(0.75)
nrounds <- c(100)
method <- c("PLR", "SVM", "RF", "NN", "RDA", "GBM")
family <- "binomial"
parameters <- list(alpha = alpha, gamma = gamma, cost = cost, ntree = ntree,
nodesize = nodesize, size = size, decay = decay,
rda.alpha = rda.alpha, delta = delta, eta = eta,
max.depth = max.depth, subsample = subsample,
nrounds = nrounds)
model <- "parafac"
nfolds <- 3
nstart <- 3
# constrain first mode weights to be orthogonal
const <- c("orthog", "uncons", "uncons")
# fit Parafac models and use third mode to tune classification methods
tune.object <- tunecpfa(x = X, y = y, model = model, nfac = nf,
nfolds = nfolds, method = method, family = family,
parameters = parameters, parallel = FALSE,
const = const, nstart = nstart)
# print tuning object
tune.object