R: Reconstruction Set Test (RESET)

reset {RESET}

R Documentation

Reconstruction Set Test (RESET)

Description

Implementation of the Reconstruction Set Test (RESET) method, which transforms an n-by-p input matrix X into an n-by-m matrix of sample-level variable set scores and a length m vector of overall variable set scores. Execution of RESET involves the following sequence of steps:

If center.X=TRUE, mean center the columns of X. If X.test is specified, the centering is instead performed on just the columns of X corresponding to each variable set. See documentation for the X and center.X parameters for more details.
If scale.X=TRUE, scale the columns of X to have variance 1. If X.test is specified, the scaling is instead performed on just the columns of X corresponding to each variable set. See documentation for the X and scale.X parameters for more details.
If center.X.test=TRUE, mean center the columns of X.test. See documentation for the X.test and center.X.test parameters for more details.
If scale.X.test=TRUE, scale the columns of X.test. See documentation for the X.test and scale.X.test parameters for more details.
Set the reconstruction target matrix T to X or, if X.test is specified, to X.test.
Compute the norm of T and norm of each row of T. By default, these are the Frobenius and Euclidean norms respectively.
For each set in var.sets, sample-level and matrix level scores are generated as follows:
- Create a subset of X called X.var.set that only includes the columns of X correponding to the variables in the set.
- Compute a rank k orthonormal basis Q for the column space of X.var.set. If the size of the set is less then or equal to random.threshold, then this is computed as the top k columns of the Q matrix from a column-pivoted QR decomposition of X.var.set, otherwise, it is approximated using a randomized algorithm implemented by randomColumnSpace.
- The reduced rank reconstruction of T is then created as Q Q^T T.
- The original T is subtracted from the reconstruction to represent the reconstruction error and the appropriate norm is computed on each row and the entire error matrix.
- The overall score is the log2 ratio of the norm of the original T to the norm of the reconstruction error matrix.
- The score for each sample is the log2 ratio of the norm of the corresponding row of the original T to the norm of the same row of the reconstruction error matrix.
- If per.var=TRUE, then the overall and sample-level scores are divided by the variable set size.

Usage

reset(X, X.test, center.X=TRUE, scale.X=FALSE, center.X.test=TRUE, scale.X.test=FALSE, 
      var.sets, k=2, random.threshold, k.buff=0, q=0, test.dist="normal", norm.type="2",
      per.var=FALSE)

Arguments

`X`	The n-by-p target matrix; columns represent variables and rows represent samples.
`X.test`	Matrix that will be combined with the `var.set` variables to compute the reduced rank reconstruction. This is typically a subset or transformation of `X`, e.g., projection on top PCs. Reconstruction error will be measured on the variables in `X.test`. If not specified, the entire `X` matrix will be used for calculating reconstruction error.
`center.X`	Flag which controls whether the values in `X` are mean centered during execution of the algorithm. If only `X` is specified and `center.X=TRUE`, then all columns in `X` will be centered. If both `X` and `X.test` are specified, then centering is performed on just the columns of `X` contained in the specified variable sets. Mean centering is especially important for accurate performance when `X.test` is specified as a reduced rank representation of the `X`, e.g, as the projection of `X` onto the top principal components. However, mean centering the entire matrix `X` can have a dramatic impact on memory requirements if `X` is a large sparse matrix. In this case, a non-centered `X` and appropriate `X.test` (e.g., project onto top PCs of `X`) can be provided and mean centering performed on just the needed variables during execution of RESET. This "just-in-time" centering is enabled by setting `center.X=TRUE` and providing both `X` and `X.test`. If `X` has already been mean-centered (and `X.test` is a subset of this mean-centered matrix or computed using this mean-centered matrix), then center should be specified as FALSE.
`scale.X`	Flag which controls whether the values in `X` are are scaled to have variance 1 during execution of the algorithm. Defaults to false. If only `X` is specified and `scale.X=TRUE`, then all columns in `X` will be scaled. If both `X` and `X.test` are specified, then scaling is performed on just the columns of `X` contained in the specified variable sets.
`center.X.test`	Flag which controls whether the values in `X.test`, if specified, are mean centered during execution of the algorithm. Centering should be performed consistently for `X` and `X.test`, i.e., if `center.X` is true or `X` was previously centered, then `center.X.test` should te true unless `X.test` previously centered or generated from a centered `X`.
`scale.X.test`	Flag which controls whether the values in `X.test`, if specified, are scaled to have variance 1 during execution of the algorithm. Similar to centering, scaling should be performed consistently for `X` and `X.test`, i.e., if `scale.X` is true or `X` was previously scaled then `scale.X.test` should te true unless `X.test` previously scaled or generated from a scaled `X`.
`var.sets`	List of m variable sets, each element is a vector of indices of variables in the set that correspond to columns in `X`. If variable set information is instead available in terms of variable names, the appropriate format can be generated using `createVarSetCollection`.
`k`	Rank of reconstruction. Default to 2. Cannot be larger than the minimum variable set size.
`random.threshold`	If specified, indicates the variable set size above which a randomized reduced-rank reconstruction is used. If the variable set size is less or equal to random.threshold, then a non-random reconstruction is computed. Defaults to k and cannot be less than k.
`k.buff`	Additional dimensions used in randomized reduced-rank construction algorithm. Defaults to 0. Values above 0 can improve the accuracy of the randomized reconstruction at the expense of additional computational complexity. If `k.buff`=0, then the reduced rank reconstruction can be generated directly from the output of `randomColumnSpace`, otherwise, a reduced rank SVD must also be computed with the reconstruction based on the top `k` components.
`q`	Number of power iterations for randomized SVD (see `randomSVD`). Defaults to 0. Although power iterations can improve randomized SVD performance in general, it can decrease the sensitivity of the RESET method to detect mean or covariance differences.
`test.dist`	Distribution for non-zero elements of random test matrix used in randomized SVD algorithm. See description for `test.dist` parameter of `randomSVD` method.
`norm.type`	The type of norm to use for computing reconstruction error. Defaults to "2" for Euclidean/Frobenius norm. Other supported option is "1" for L1 norm.
`per.var`	If true, the computed scores for each variable set are divided by the scaled variable set size to generate per-variable scores. Variable set size scaling is performed by dividing all sizes by the mean size (this will generate per-variable scores of approximately the same magnitude as the non-per-variable scores).

Value

A list with the following elements:

S an n-by-m matrix of sample-level variable set scores.
v a length m vector of overall variable set scores.

Examples

  # Create a collection of 5 variable sets each of size 10
  var.sets = list(set1=1:10, 
                  set2=11:20,
                  set3=21:30,
                  set4=31:40,
                  set5=41:50)                  

  # Simulate a 100-by-100 matrix of random Poisson data
  X = matrix(rpois(10000, lambda=1), nrow=100)

  # Inflate first 10 rows for first 10 variables, i.e., the first
  # 10 samples should have elevated scores for the first variable set
  X[1:10,1:10] = rpois(100, lambda=5)

  # Execute RESET using non-randomized basis computation
  reset(X, var.sets=var.sets, k=2, random.threshold=10)

  # Execute RESET with randomized basis computation
  # (random.threshold will default to k value which is less
  # than the size of all variable sets)
  reset(X, var.sets=var.sets, k=2, k.buff=2)

  # Execute RESET with non-zero k.buff
  reset(X, var.sets=var.sets, k=2, k.buff=2)
  
  # Execute RESET with non-zero q
  reset(X, var.sets=var.sets, k=2, q=1)

  # Execute RESET with L1 vs L2 norm
  reset(X, var.sets=var.sets, k=2, norm.type="1")

  # Project the X matrix onto the first 5 PCs and use that as X.test
  # Scale X before calling prcomp() so that no centering or scaling
  # is needed within reset()
  X = scale(X)
  X.test = prcomp(X,center=FALSE,scale=FALSE,retx=TRUE)$x[,1:5]
  reset(X, X.test=X.test, center.X=FALSE, scale.X=FALSE, 
    center.X.test=FALSE, scale.X.test=FALSE, var.sets=var.sets, k=2)