simBulkProfiles {digitalDLSorteR} | R Documentation |
Simulate training and test pseudo-bulk RNA-Seq profiles
Description
Simulate training and test pseudo-bulk RNA-Seq profiles using the cell
composition matrices generated by the generateBulkCellMatrix
function. The samples are generated under the assumption that the expression
level of the i
gene in the j
bulk sample is given by the sum of
the expression levels of the cell types X_{ijk}
that make them up
weighted by the proportions of these k
cell types in each sample. In
practice, as described in Torroja and Sanchez-Cabo, 2019, these profiles are
generated by summing a number of cells of different cell types determined by
proportions from a matrix of known cell composition. The number of simulated
pseudo-bulk RNA-Seq samples and the number of cells composing each sample are
determined by generateBulkCellMatrix
(see Documentation)
Note: this step can be avoided by using the on.the.fly
argument in the trainDDLSModel
function. See
Documentation for more information.
Usage
simBulkProfiles(
object,
type.data = "both",
pseudobulk.function = "AddRawCount",
file.backend = NULL,
compression.level = NULL,
block.processing = FALSE,
block.size = 1000,
chunk.dims = NULL,
threads = 1,
verbose = TRUE
)
Arguments
object |
|
type.data |
Type of data to generate between |
pseudobulk.function |
Function used to build pseudo-bulk samples. It may be:
|
file.backend |
Valid file path to store the simulated single-cell
expression profiles as an HDF5 file ( |
compression.level |
The compression level used if |
block.processing |
Boolean indicating whether the data should be
simulated in blocks (only if |
block.size |
Only if |
chunk.dims |
Specifies the dimensions that HDF5 chunk will have. If
|
threads |
Number of threads used during the simulation of pseudo-bulk
samples (1 by default). Set according to computational resources and avoid
it if |
verbose |
Show informative messages during the execution ( |
Details
digitalDLSorteR allows the use of HDF5 files as back-end to store the
resulting data using the DelayedArray and HDF5Array packages.
This functionality allows to work without keeping the data loaded into RAM,
which could be of vital importance during some computationally heavy steps
such as neural network training on RAM-limited machines. You must provide a
valid file path in the file.backend
argument to store the resulting
file with the '.h5' extension. The data will be accessible from R without
being loaded into memory. This option slightly slows down execution times, as
subsequent transformations of the data will be done in blocks rather than
using all the data. We recommend this option according to the computational
resources available and the number of pseudo-bulk samples to be generated.
Note that if you use the file.backend
argument with
block.processing = FALSE
, all pseudo-bulk profiles will be simulated
in one step and, therefore, loaded into RAM. Then, the data will be written
to an HDF5 file. To avoid the RAM collapse, pseudo-bulk profiles can be
simulated and written to HDF5 files in blocks of block.size
size by
setting block.processing = TRUE
.
It is possible to avoid this step by using the on.the.fly
argument in
the trainDDLSModel
function. In this way, data is
generated 'on the fly' during the neural network training. For more details,
see ?trainDDLSModel
.
Value
A DigitalDLSorter
object with bulk.simul
slot containing a list with one or two entries (depending on selected
type.data
argument): 'train'
and 'test'
. Each entry
contains a SummarizedExperiment
object
with simulated bulk samples in the assay
slot, sample names in the
colData
slot and feature names in the rowData
slot.
References
Fischer B, Smith M and Pau, G (2020). rhdf5: R Interface to HDF5. R package version 2.34.0.
Pagès H, Hickey P and Lun A (2020). DelayedArray: A unified framework for working transparently with on-disk and in-memory array-like datasets. R package version 0.16.0.
Pagès H (2020). HDF5Array: HDF5 backend for DelayedArray objects. R package version 1.18.0.
See Also
generateBulkCellMatrix
ProbMatrixCellTypes
trainDDLSModel
Examples
set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(30, lambda = 5), nrow = 15, ncol = 10,
dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(10)),
Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(15))
)
)
DDLS <- createDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
sc.log.FC = FALSE
)
probMatrixValid <- data.frame(
Cell_Type = paste0("CellType", seq(2)),
from = c(1, 30),
to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
object = DDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
prob.design = probMatrixValid,
num.bulk.samples = 10,
verbose = TRUE
)
DDLS <- simBulkProfiles(DDLS, verbose = TRUE)