R: Sparse principal component analysis using ReFACTor

refactor {TCA}

R Documentation

Sparse principal component analysis using ReFACTor

Description

Performs unsupervised feature selection followed by principal component analysis (PCA) under a row-sparse model using the ReFACTor algorithm. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), refactor allows to capture the variation in cell-type composition, which was shown to be a dominant sparse signal in methylation data.

Usage

refactor(
  X,
  k,
  sparsity = 500,
  C = NULL,
  C.remove = FALSE,
  sd_threshold = 0.02,
  num_comp = NULL,
  rand_svd = FALSE,
  log_file = "TCA.log",
  debug = FALSE,
  verbose = TRUE
)

Arguments

`X`	An `m` by `n` matrix of measurements of `m` features for `n` observations. Each column in `X` is assumed to be a mixture of `k` sources. Note that `X` must include row names and column names and that NA values are currently not supported. `X` should not include features that are constant across all observations.
`k`	A numeric value indicating the dimension of the signal in `X` (i.e. the number of sources).
`sparsity`	A numeric value indicating the sparsity of the signal in `X` (the number of signal rows).
`C`	An `n` by `p` design matrix of covariates that will be accounted for in the feature selection step. An intercept term will be included automatically. Note that `C` must include row names and column names and that NA values are currently not supported; set `C` to be `NULL` if there are no such covariates.
`C.remove`	A logical value indicating whether the covariates in X should be accounted for not only in the feature selection step, but also in the final calculation of the principal components (i.e. if `C.remove == TRUE` then the selected features will be adjusted for the covariates in `C` prior to calculating principal components). Note that setting `C.remove` to be `TRUE` is desired when ReFACTor is intended to be used for correction in downstream analysis, whereas setting `C.remove` to be `FALSE` is desired when ReFACTor is merely used for capturing the sparse signals in `X` (i.e. regardless of correction).
`sd_threshold`	A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in `X` (i.e. features with standard deviation lower than `sd_threshold` will be excluded). Set `sd_threshold` to be `NULL` for turning off this filter. Note that removing features with very low variability tends to improve speed and performance.
`num_comp`	A numeric value indicating the number of ReFACTor components to return.
`rand_svd`	A logical value indicating whether to use random svd for estimating the low-rank structure of the data in the first step of the algorithm; random svd can result in a substantial speedup for large data.
`log_file`	A path to an output log file. Note that if the file `log_file` already exists then logs will be appended to the end of the file. Set `log_file` to `NULL` to prevent output from being saved into a file; note that if `verbose == FALSE` then no output file will be generated regardless of the value of `log_file`.
`debug`	A logical value indicating whether to set the logger to a more detailed debug level; set `debug` to `TRUE` before reporting issues.
`verbose`	A logical value indicating whether to print logs.

Details

ReFACTor is a two-step algorithm for sparse principal component analysis (PCA) under a row-sparse model. The algorithm performs an unsupervised feature selection by ranking the features based on their correlation with their values under a low-rank representation of the data, followed by a calculation of principal components using the top ranking features (ReFACTor components).

Note that ReFACTor is tuned towards capturing sparse signals of the dominant sources of variation in the data. Therefore, in the presence of other potentially dominant factors in the data (i.e. beyond the variation of interest), these factors should be accounted for by including them as covariates (see argument C). In cases where the ReFACTor components are designated to be used as covariates in a downstream analysis alongside the covariates in C (e.g., in a standard regression analysis or in a TCA regression), it is advised to set the argument C.remove to be TRUE. This will adjust the selected features for the information in C prior to the calculation of the ReFACTor components, which will therefore capture only signals that is not present in C (and as a result may benefit the downstream analysis by potentially capturing more signals beyond the information in C).

Value

A list with the estimated components of the ReFACTor model.

`scores`	An `n` by `num_comp` matrix of the ReFACTor components (the projection scores).
`coeffs`	A `sparsity` by `num_comp` matrix of the coefficients of the ReFACTor components (the projection loadings).
`ranked_list`	A vector with the features in `X`, ranked by their scores in the feature selection step of the algorithm; the top scoring features (set according to the argument `sparsity`) are used for calculating the ReFACTor components. Note that features that were excluded according to `sd_threshold` will not appear in this `ranked_list`.

Note

For very large input matrices it is advised to use random svd for speeding up the feature selection step (see argument rand_svd).

References

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nature Methods 2017.

Examples

data <- test_data(100, 200, 3, 0, 0, 0.01)
ref <- refactor(data$X, k = 3, sparsity = 50)

[Package TCA version 1.2.1 Index]