generateBulkCellMatrix {digitalDLSorteR} | R Documentation |
Generate training and test cell composition matrices
Description
Generate training and test cell composition matrices for the simulation of
pseudo-bulk RNA-Seq samples with known cell composition using single-cell
expression profiles. The resulting ProbMatrixCellTypes
object contains a matrix that determines the proportion of the different cell
types that will compose the simulated pseudo-bulk samples. In addition, this
object also contains other information relevant for the process. This
function does not simulate pseudo-bulk samples, this task is performed by the
simBulkProfiles
or trainDDLSModel
functions (see Documentation).
Usage
generateBulkCellMatrix(
object,
cell.ID.column,
cell.type.column,
prob.design,
num.bulk.samples,
n.cells = 100,
train.freq.cells = 3/4,
train.freq.bulk = 3/4,
proportion.method = c(10, 5, 20, 15, 35, 15),
prob.sparsity = 0.5,
min.zero.prop = NULL,
balanced.type.cells = FALSE,
verbose = TRUE
)
Arguments
object |
|
cell.ID.column |
Name or column number corresponding to the cell names of expression matrix in cells metadata. |
cell.type.column |
Name or column number corresponding to the cell type of each cell in cells metadata. |
prob.design |
Data frame with the expected frequency ranges for each cell type present in the experiment. This information can be estimated from literature or from the single-cell experiment itself. This data frame must be constructed by three columns with specific headings (see examples):
|
num.bulk.samples |
Number of bulk RNA-Seq sample proportions (and thus
simulated bulk RNA-Seq samples) to be generated taking into account
training and test data. We recommend seting this value according to the
number of single-cell profiles available in
|
n.cells |
Number of cells that will be aggregated in order to simulate one bulk RNA-Seq sample (100 by default). |
train.freq.cells |
Proportion of cells used to simulate training pseudo-bulk samples (2/3 by default). |
train.freq.bulk |
Proportion of bulk RNA-Seq samples to the total number
( |
proportion.method |
Vector of six integers that determines the proportions of bulk samples generated by the different methods (see Details and Torroja and Sanchez-Cabo, 2019. for more information). This vector represents proportions, so its entries must add up 100. By default, a majority of random samples will be generated without using predefined ranges. |
prob.sparsity |
It only affects the proportions generated by the first method (Dirichlet distribution). It determines the probability of having missing cell types in each simulated spot, as opposed to a mixture of all cell types. A higher value for this parameter will result in more sparse simulated samples. |
min.zero.prop |
This parameter controls the minimum number of cell types
that will be absent in each simulated spot. If |
balanced.type.cells |
Boolean indicating whether the training and test
cells will be split in a balanced way considering the cell types
( |
verbose |
Show informative messages during the execution ( |
Details
First, the available single-cell profiles are split into training and test
subsets (2/3 for training and 1/3 for test by default (see
train.freq.cells
)) to avoid falsifying the results during model
evaluation. Next, num.bulk.samples
bulk samples proportions are built
and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq
sample are set, being 100 cells per bulk sample by default (see
n.cells
argument). The proportions of training and test pseudo-bulk
samples are set by train.freq.bulk
(2/3 for training and 1/3 for
testing by default). Finally, in order to avoid biases due to the composition
of the pseudo-bulk RNA-Seq samples, cell type proportions (w_1,...,w_k
,
where k
is the number of cell types available in single-cell profiles)
are randomly generated by using six different approaches:
Cell proportions are randomly sampled from a truncated uniform distribution with predefined limits according to a priori knowledge of the abundance of each cell type (see
prob.design
argument). This information can be inferred from the single-cell experiment itself or from the literature.A second set is generated by randomly permuting cell type labels from a distribution generated by the previous method.
Cell proportions are randomly sampled as by method 1 without replacement.
-
Using the last method for generating proportions, cell types labels are randomly sampled.
Cell proportions are randomly sampled from a Dirichlet distribution.
Pseudo-bulk RNA-Seq samples composed of the same cell type are generated in order to provide 'pure' pseudo-bulk samples.
If you want to inspect the distribution of cell type proportions generated by
each method during the process, they can be visualized by the
showProbPlot
function (see Documentation).
Value
A DigitalDLSorter
object with
prob.cell.types
slot containing a list
with two
ProbMatrixCellTypes
objects (training and test). For
more information about the structure of this class, see
?ProbMatrixCellTypes
.
References
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978
See Also
simBulkProfiles
ProbMatrixCellTypes
Examples
set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(30, lambda = 5), nrow = 15, ncol = 10,
dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(10)),
Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(15))
)
)
DDLS <- createDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
sc.log.FC = FALSE
)
probMatrixValid <- data.frame(
Cell_Type = paste0("CellType", seq(2)),
from = c(1, 30),
to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
object = DDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
prob.design = probMatrixValid,
num.bulk.samples = 10,
verbose = TRUE
)