R: Random Forest with Canonical Correlation Analysis

rfcca {RFCCA}

R Documentation

Random Forest with Canonical Correlation Analysis

Description

Estimates the canonical correlations between two sets of variables depending on the subject-related covariates.

Usage

rfcca(
  X,
  Y,
  Z,
  ntree = 200,
  mtry = NULL,
  nodesize = NULL,
  nodedepth = NULL,
  nsplit = 10,
  importance = FALSE,
  finalcca = c("cca", "scca", "rcca"),
  bootstrap = TRUE,
  samptype = c("swor", "swr"),
  sampsize = if (samptype == "swor") function(x) {
     x * 0.632
 } else function(x)
    {
     x
 },
  forest = TRUE,
  membership = FALSE,
  bop = TRUE,
  Xcenter = TRUE,
  Ycenter = TRUE,
  ...
)

Arguments

`X`	The first multivariate data set which has `n` observations and `px` variables. A data.frame of numeric values.
`Y`	The second multivariate data set which has `n` observations and `py` variables. A data.frame of numeric values.
`Z`	The set of subject-related covariates which has `n` observations and `pz` variables. Used in random forest growing. A data.frame with numeric values and factors.
`ntree`	Number of trees.
`mtry`	Number of z-variables randomly selected as candidates for splitting a node. The default is `pz/3` where `pz` is the number of z variables. Values are always rounded up.
`nodesize`	Forest average number of unique data points in a terminal node. The default is the `3 * (px+py)` where `px` and `py` are the number of x and y variables, respectively.
`nodedepth`	Maximum depth to which a tree should be grown. In the default, this parameter is ignored.
`nsplit`	Non-negative integer value for the number of random splits to consider for each candidate splitting variable. When zero or `NULL`, all possible splits considered.
`importance`	Should variable importance of z-variables be assessed? The default is `FALSE`.
`finalcca`	Which CCA should be used for final canonical correlation estimation? Choices are `cca`, `scca` and `rcca`, see below for details. The default is `cca`.
`bootstrap`	Should the data be bootstrapped? The default value is `TRUE` which bootstraps the data by sampling without replacement. If `FALSE` is chosen, the data is not bootstrapped. It is not possible to return OOB predictions and variable importance measures if `FALSE` is chosen.
`samptype`	Type of bootstrap. Choices are `swor` (sampling without replacement/sub-sampling) and `swr` (sampling with replacement/ bootstrapping). The default action here (as in `randomForestSRC`) is sampling without replacement.
`sampsize`	Size of sample to draw. For sampling without replacement, by default it is .632 times the sample size. For sampling with replacement, it is the sample size.
`forest`	Should the forest object be returned? It is used for prediction on new data. The default is `TRUE`.
`membership`	Should terminal node membership and inbag information be returned?
`bop`	Should the Bag of Observations for Prediction (BOP) for training observations be returned? The default is `TRUE`.
`Xcenter`	Should the columns of X be centered? The default is `TRUE`.
`Ycenter`	Should the columns of Y be centered? The default is `TRUE`.
`...`	Optional arguments to be passed to other methods.

Value

An object of class (rfcca,grow) which is a list with the following components:

`call`	The original call to `rfcca`.
`n`	Sample size of the data (`NA`'s are omitted).
`ntree`	Number of trees grown.
`mtry`	Number of variables randomly selected for splitting at each node.
`nodesize`	Minimum forest average number of unique data points in a terminal node.
`nodedepth`	Maximum depth to which a tree is allowed to be grown.
`nsplit`	Number of randomly selected split points.
`xvar`	Data frame of x-variables.
`xvar.names`	A character vector of the x-variable names.
`yvar`	Data frame of y-variables.
`yvar.names`	A character vector of the y-variable names.
`zvar`	Data frame of z-variables.
`zvar.names`	A character vector of the z-variable names.
`leaf.count`	Number of terminal nodes for each tree in the forest. Vector of length `ntree`.
`bootstrap`	Was the data bootstrapped?
`forest`	If `forest=TRUE`, the `rfcca` forest object is returned. This object is used for prediction with new data.
`membership`	A matrix recording terminal node membership where each cell represents the node number that an observations falls in for that tree.
`importance`	Variable importance measures (VIMP) for each z-variable.
`inbag`	A matrix recording inbag membership where each cell represents whether the observation is in the bootstrap sample in the corresponding tree.
`predicted.oob`	OOB predicted canonical correlations for training observations based on the selected final canonical correlation estimation method.
`predicted.coef`	Predicted canonical weight vectors for x- and y- variables.
`bop`	If `bop=TRUE`, a list containing BOP for each training observation is returned.
`finalcca`	The selected CCA used for final canonical correlation estimations.
`rfsrc.grow`	An object of class `(rfsrc,grow)` is returned. This object is used for prediction with training or new data.

Details

Final canonical correlation estimation:: Final canonical correlation can be computed with CCA (Hotelling, 1936), Sparse CCA (Witten et al., 2009) or Regularized CCA (Vinod,1976; Leurgans et al., 1993). If Regularized CCA will be used, \lambda_1 and \lambda_2 should be specified.

References

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

Leurgans, S. E., Moyeed, R. A., & Silverman, B. W. (1993). Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological), 55(3), 725-740.

Vinod, H.D. (1976). Canonical ridge and econometrics of joint production. Journal of econometrics, 4(2), 147–166.

Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.

Examples


## load generated example data
data(data, package = "RFCCA")
set.seed(2345)

## define train/test split
smp <- sample(1:nrow(data$X), size = round(nrow(data$X) * 0.7),
  replace = FALSE)
train.data <- lapply(data, function(x) {x[smp, ]})
test.Z <- data$Z[-smp, ]

## train rfcca
rfcca.obj <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z,
  ntree = 100, importance = TRUE)

## print the grow object
print(rfcca.obj)

## get the OOB predictions
pred.oob <- rfcca.obj$predicted.oob

## predict with new test data
pred.obj <- predict(rfcca.obj, newdata = test.Z)
pred <- pred.obj$predicted

## get the variable importance measures
z.vimp <- rfcca.obj$importance

## train rfcca and estimate the final canonical correlations with "scca"
rfcca.obj2 <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z,
  ntree = 100, finalcca = "scca")

[Package RFCCA version 2.0.0 Index]