R: Generate training and test cell composition matrices

generateBulkCellMatrix {digitalDLSorteR}

R Documentation

Generate training and test cell composition matrices

Description

Generate training and test cell composition matrices for the simulation of pseudo-bulk RNA-Seq samples with known cell composition using single-cell expression profiles. The resulting ProbMatrixCellTypes object contains a matrix that determines the proportion of the different cell types that will compose the simulated pseudo-bulk samples. In addition, this object also contains other information relevant for the process. This function does not simulate pseudo-bulk samples, this task is performed by the simBulkProfiles or trainDDLSModel functions (see Documentation).

Usage

generateBulkCellMatrix(
  object,
  cell.ID.column,
  cell.type.column,
  prob.design,
  num.bulk.samples,
  n.cells = 100,
  train.freq.cells = 3/4,
  train.freq.bulk = 3/4,
  proportion.method = c(10, 5, 20, 15, 35, 15),
  prob.sparsity = 0.5,
  min.zero.prop = NULL,
  balanced.type.cells = FALSE,
  verbose = TRUE
)

Arguments

`object`	`DigitalDLSorter` object with `single.cell.real` slot and, optionally, with `single.cell.simul` slot.
`cell.ID.column`	Name or column number corresponding to the cell names of expression matrix in cells metadata.
`cell.type.column`	Name or column number corresponding to the cell type of each cell in cells metadata.
`prob.design`	Data frame with the expected frequency ranges for each cell type present in the experiment. This information can be estimated from literature or from the single-cell experiment itself. This data frame must be constructed by three columns with specific headings (see examples): A cell type column with the same name of the cell type column in cells metadata (`cell.type.column`). If the name of the column is not the same, the function will return an error. All cell types must appear in the cells metadata. A second column called `'from'` with the start frequency for each cell type. A third column called `'to'` with the ending frequency for each cell type.
`num.bulk.samples`	Number of bulk RNA-Seq sample proportions (and thus simulated bulk RNA-Seq samples) to be generated taking into account training and test data. We recommend seting this value according to the number of single-cell profiles available in `DigitalDLSorter` object avoiding an excesive re-sampling, but generating a large number of samples for better training.
`n.cells`	Number of cells that will be aggregated in order to simulate one bulk RNA-Seq sample (100 by default).
`train.freq.cells`	Proportion of cells used to simulate training pseudo-bulk samples (2/3 by default).
`train.freq.bulk`	Proportion of bulk RNA-Seq samples to the total number (`num.bulk.samples`) used for the training set (2/3 by default).
`proportion.method`	Vector of six integers that determines the proportions of bulk samples generated by the different methods (see Details and Torroja and Sanchez-Cabo, 2019. for more information). This vector represents proportions, so its entries must add up 100. By default, a majority of random samples will be generated without using predefined ranges.
`prob.sparsity`	It only affects the proportions generated by the first method (Dirichlet distribution). It determines the probability of having missing cell types in each simulated spot, as opposed to a mixture of all cell types. A higher value for this parameter will result in more sparse simulated samples.
`min.zero.prop`	This parameter controls the minimum number of cell types that will be absent in each simulated spot. If `NULL` (by default), this value will be half of the total number of different cell types, but increasing it will result in more spots composed of fewer cell types. This helps to create more sparse proportions and cover a wider range of situations during model training.
`balanced.type.cells`	Boolean indicating whether the training and test cells will be split in a balanced way considering the cell types (`FALSE` by default).
`verbose`	Show informative messages during the execution (`TRUE` by default).

Details

First, the available single-cell profiles are split into training and test subsets (2/3 for training and 1/3 for test by default (see train.freq.cells)) to avoid falsifying the results during model evaluation. Next, num.bulk.samples bulk samples proportions are built and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq sample are set, being 100 cells per bulk sample by default (see n.cells argument). The proportions of training and test pseudo-bulk samples are set by train.freq.bulk (2/3 for training and 1/3 for testing by default). Finally, in order to avoid biases due to the composition of the pseudo-bulk RNA-Seq samples, cell type proportions (w_1,...,w_k, where k is the number of cell types available in single-cell profiles) are randomly generated by using six different approaches:

Cell proportions are randomly sampled from a truncated uniform distribution with predefined limits according to a priori knowledge of the abundance of each cell type (see prob.design argument). This information can be inferred from the single-cell experiment itself or from the literature.
A second set is generated by randomly permuting cell type labels from a distribution generated by the previous method.
Cell proportions are randomly sampled as by method 1 without replacement.
Using the last method for generating proportions, cell types labels are randomly sampled.
Cell proportions are randomly sampled from a Dirichlet distribution.
Pseudo-bulk RNA-Seq samples composed of the same cell type are generated in order to provide 'pure' pseudo-bulk samples.

If you want to inspect the distribution of cell type proportions generated by each method during the process, they can be visualized by the showProbPlot function (see Documentation).

Value

A DigitalDLSorter object with prob.cell.types slot containing a list with two ProbMatrixCellTypes objects (training and test). For more information about the structure of this class, see ?ProbMatrixCellTypes.

References

Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978

Examples

set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10, 
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10, 
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
DDLS <- createDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE, 
  sc.log.FC = FALSE
)
probMatrixValid <- data.frame(
  Cell_Type = paste0("CellType", seq(2)),
  from = c(1, 30),
  to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
  object = DDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  prob.design = probMatrixValid,
  num.bulk.samples = 10,
  verbose = TRUE
)

[Package digitalDLSorteR version 1.0.1 Index]