scdecon {SCdeconR} | R Documentation |
Deconvolution of bulk RNA-seq data
Description
Deconvolution of bulk RNA-seq data based on single-cell reference data. Eight bulk deconvolution methods, along with eight normalization methods and four transformation methods are available.
Usage
scdecon(
bulk,
ref,
phenodata,
filter_ref = TRUE,
marker_genes = NULL,
genes_to_remove = NULL,
min_pct_ct = 0.05,
decon_method = c("scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC",
"SCDC", "scTAPE"),
norm_method_sc = c("none", "LogNormalize", "SCTransform", "scran", "scater", "Linnorm"),
norm_method_bulk = c("none", "TMM", "median_ratios", "TPM"),
trans_method_sc = c("none", "log2", "sqrt", "vst"),
trans_method_bulk = c("none", "log2", "sqrt", "vst"),
gene_length = NULL,
lfc_markers = log2(1.5),
marker_strategy = c("all", "pos_fc", "top_50p_logFC", "top_50p_AveExpr"),
to_remove = NULL,
ffpe_artifacts = FALSE,
model = NULL,
prop = NULL,
cibersortpath = NULL,
pythonpath = NULL,
tmpdir = NULL,
remove_tmpdir = TRUE,
seed = 1234,
nsamples = 1000,
return_value_only = FALSE,
verbose = FALSE
)
Arguments
bulk |
a matrix or data.frame of unnormalizaed & untransformed bulk RNA-seq gene expression values with rows representing genes, columns representing samples |
ref |
a matrix or data.frame of untransformed scRNA-seq gene expression counts with rows representing genes, columns representing cells. This data will be used to deconvolute provided bulk RNA-seq data. |
phenodata |
a data.frame with rows representing cells, columns representing cell attributes. It should at least contain the first three columns as:
|
filter_ref |
logical value indicating whether outlier genes & cells should be removed from the provided reference data. Defaults to TRUE |
marker_genes |
a data.frame of two columns. First column represents cell types in ref; second column represents gene names of marker genes. If specified, those genes will be used to construct signature matrix for mark-gene based deconvolution methods, such as CIBERSORT, OLS, nnls, FARDEEP and RLR. Default to NULL, carry out differential analysis to identify marker genes for each cell type in ref. |
genes_to_remove |
a vector of gene names to remove from the reference scRNA-seq data. Default to NULL. |
min_pct_ct |
a numeric value indicating the minimum required proportion of expressing cells per cell type for marker gene identification. Only applicable when marker_genes is NULL. Default to 0.05. |
decon_method |
character value specifying the deconvolution method to use. Has to be one of "scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC", "SCDC", "scTAPE". See details for more information. |
norm_method_sc |
character value specifying the normalization method to use for reference data. Has to be one of "none","LogNormalize", "SCTransform", "scran", "scater", "Linnorm". See details for more information. |
norm_method_bulk |
character value specifying the normalization method to use for bulk data. Has to be one of "none", "TMM", "median_ratios", "TPM". See details for more information. |
trans_method_sc |
character value specifying the transformation method to use for both bulk & reference data. Has to be one of "none", "log2", "sqrt", "vst". See details for more information. |
trans_method_bulk |
character value specifying the transformation method to use for both bulk & reference data. Has to be one of "none", "log2", "sqrt", "vst". See details for more information. |
gene_length |
a data.frame with two columns. The first column represents gene names that match with provided bulk data. The second column represents length of each gene. Only applicable when norm_method is selected as "TPM". |
lfc_markers |
log2 fold change cutoff used to identify marker genes for deconvolution. The option only applicable to marker-gene based approaches, such as CIBERSORT, OLS, nnls, FARDEEP and RLR. Only applicable when marker_genes is NULL. |
marker_strategy |
further strategy in selecting marker genes besides applying the log2 fold change cutoff. Can be chosen from: "all", "pos_fc", "top_50p_logFC" or "top_50p_AveExpr". See details for more information. Only applicable when marker_genes is NULL. |
to_remove |
character value representing the cell type to remove from reference data. Only applicable to simulation experiments in evaluating effect of cell type removal from reference. |
ffpe_artifacts |
logical value indicating whether to add simulated ffpe artifacts in the bulk data. Only applicable to simulation experiments in evaluating the effect of FFPE artifacts. |
model |
pre-constructed ffpe model data. Can be downloaded from github: https://github.com/Liuy12/SCdeconR/blob/master/data/ffpemodel.rda. |
prop |
a matrix or data.frame of simulated cell proportion values with rows representing cell types, columns representing samples. Only applicable to simulation experiments in evaluating the effect of cell type removal from reference. |
cibersortpath |
full path to CIBERSORT.R script. |
pythonpath |
full path to python binary where scaden was installed with. |
tmpdir |
temporary processing directory for scaden or scTAPE. |
remove_tmpdir |
a logical value indicating whether to remove tmpdir once scaden is completed. Default to TRUE. |
seed |
random seed used for simulating FFPE artifacts. Only applicable when ffpe_artifacts is set to TRUE. |
nsamples |
number of artificial bulk samples to simulate for scaden. Default to 1000. |
return_value_only |
return a list of values only without performing deconvolution. This could be helpful in cases where the user want to apply their own deconvolution algorithms. Default to FALSE. |
verbose |
a logical value indicating whether to print messages. Default to FALSE. |
Details
decon_method should be one of the following:
- scaden
a deep learning based method using three multi-layer deep neural nets. To use scaden, you need to firstly install scaden via
pip
orconda
, the provide the python path topythonpath
option.- CIBERSORT
a marker gene based support vectors regression approach. CIBERSOR does not allow redistribution. To use CIBERSORT, you need to request the source code from the authors & provide the path of CIBERSORT.R script to
cibersortpath
option.- OLS
ordinary least squares.
- nnls
non-negative least squares.
- FARDEEP
robust regression using least trimmed squares
- RLR
robust regression using an M estimator
- MuSiC
multi-subject single-cell deconvolution
- SCDC
an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets
- scTAPE
Deep autoencoder based deconvolution
norm_method should be one of the following:
- none
no normalization is performed.
- LogNormalize
LogNormalize
method from seurat.- TMM
TMM method from
calcNormFactors
function from edgeR.- median_ratios
median ratio method from
estimateSizeFactors,DESeqDataSet-method
function from DESeq2.- TPM
Transcript per million. TPM has to be chosen if ffpe_artifacts is set to TRUE.
- SCTransform
SCTransform
method from Seurat.- scran
computeSumFactors
method from scran.- scater
librarySizeFactors
method from scater.- Linnorm
Linnorm
method from Linnorm.
trans_method should be one of the following:
- none
no transformation is performed.
- log2
log2 transformation. 0.1 is added to the data to avoid logarithm of 0s.
- sqrt
square root transformation.
- vst
varianceStabilizingTransformation
method from DESeq2.
marker_strategy should be one of the following:
- all
all genes passed fold change threshold will be used.
- pos_fc
only genes with positive fold changes will be used.
- top_50p_logFC
only genes with top 50 percent positive fold changes will be used.
- top_50p_AveExpr
only genes with top 50 percent average expression will be used.
Value
a list containing two or four elements.
- first element
a data.frame of predicted cell-type proportions, with rows representing cell types, columns representing samples.
- second element
a data.frame of fitting errors of the algorithm; first column represents sample names, second column represents RMSEs.
- optional third element
a data.frame of simulated cell proportion after removing the specified cell_type. Only applicable to simulation experiments.
- optional fourth element
a data.frame of marker genes used for deconvolution. Only applicable to marker-gene based deconvolution methods.
Examples
ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"),
paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2"))
phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"),
"/refdata/phenodata_sample1.txt")
phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"),
"/refdata/phenodata_sample2.txt")
phenodata_list <- c(phenopath1,phenopath2)
# construct integrated reference using harmony algorithm
refdata <- construct_ref(ref_list = ref_list,
phenodata_list = phenodata_list,
data_type = "cellranger",
method = "harmony",
group_var = "subjectid",
nfeature_rna = 50,
vars_to_regress = "percent_mt", verbose = FALSE)
phenodata <- data.frame(cellid = colnames(refdata),
celltypes = refdata$celltype,
subjectid = refdata$subjectid)
prop <- data.frame(celltypes = unique(refdata$celltype),
proportion = rep(1/length(unique(refdata$celltype)), length(unique(refdata$celltype))))
bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
phenodata = phenodata,
num_mixtures = 20,
prop = prop,
num_mixtures_sprop = 1)
## perform deconvolution based on "OLS" algorithm
decon_res <- scdecon(bulk = bulk_sim[[1]],
ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
phenodata = phenodata,
filter_ref = TRUE,
decon_method = "OLS",
norm_method_sc = "LogNormalize",
norm_method_bulk = "TMM",
trans_method_sc = "none",
trans_method_bulk = "log2",
marker_strategy = "all")