refactor {TCA} | R Documentation |
Sparse principal component analysis using ReFACTor
Description
Performs unsupervised feature selection followed by principal component analysis (PCA) under a row-sparse model using the ReFACTor algorithm. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), refactor
allows to capture the variation in cell-type composition, which was shown to be a dominant sparse signal in methylation data.
Usage
refactor(
X,
k,
sparsity = 500,
C = NULL,
C.remove = FALSE,
sd_threshold = 0.02,
num_comp = NULL,
rand_svd = FALSE,
log_file = "TCA.log",
debug = FALSE,
verbose = TRUE
)
Arguments
X |
An |
k |
A numeric value indicating the dimension of the signal in |
sparsity |
A numeric value indicating the sparsity of the signal in |
C |
An |
C.remove |
A logical value indicating whether the covariates in X should be accounted for not only in the feature selection step, but also in the final calculation of the principal components (i.e. if |
sd_threshold |
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in |
num_comp |
A numeric value indicating the number of ReFACTor components to return. |
rand_svd |
A logical value indicating whether to use random svd for estimating the low-rank structure of the data in the first step of the algorithm; random svd can result in a substantial speedup for large data. |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
Details
ReFACTor is a two-step algorithm for sparse principal component analysis (PCA) under a row-sparse model. The algorithm performs an unsupervised feature selection by ranking the features based on their correlation with their values under a low-rank representation of the data, followed by a calculation of principal components using the top ranking features (ReFACTor components).
Note that ReFACTor is tuned towards capturing sparse signals of the dominant sources of variation in the data. Therefore, in the presence of other potentially dominant factors in the data (i.e. beyond the variation of interest), these factors should be accounted for by including them as covariates (see argument C
). In cases where the ReFACTor components are designated to be used as covariates in a downstream analysis alongside the covariates in C
(e.g., in a standard regression analysis or in a TCA regression), it is advised to set the argument C.remove
to be TRUE
. This will adjust the selected features for the information in C
prior to the calculation of the ReFACTor components, which will therefore capture only signals that is not present in C
(and as a result may benefit the downstream analysis by potentially capturing more signals beyond the information in C
).
Value
A list with the estimated components of the ReFACTor model.
scores |
An |
coeffs |
A |
ranked_list |
A vector with the features in |
Note
For very large input matrices it is advised to use random svd for speeding up the feature selection step (see argument rand_svd
).
References
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nature Methods 2017.
Examples
data <- test_data(100, 200, 3, 0, 0, 0.01)
ref <- refactor(data$X, k = 3, sparsity = 50)