R: Train Deep Neural Network model

trainDDLSModel {digitalDLSorteR}

R Documentation

Train Deep Neural Network model

Description

Train a Deep Neural Network model using the training data from DigitalDLSorter object. In addition, the trained model is evaluated with test data and prediction results are computed to determine its performance (see ?calculateEvalMetrics). Training and evaluation can be performed using simulated profiles stored in the DigitalDLSorter object or 'on the fly' by simulating the pseudo-bulk profiles at the same time as the training/evaluation is performed (see Details).

Usage

trainDDLSModel(
  object,
  type.data.train = "bulk",
  type.data.test = "bulk",
  batch.size = 64,
  num.epochs = 60,
  num.hidden.layers = 2,
  num.units = c(200, 200),
  activation.fun = "relu",
  dropout.rate = 0.25,
  loss = "kullback_leibler_divergence",
  metrics = c("accuracy", "mean_absolute_error", "categorical_accuracy"),
  normalize = TRUE,
  scaling = "standardize",
  norm.batch.layers = TRUE,
  custom.model = NULL,
  shuffle = TRUE,
  use.generator = FALSE,
  on.the.fly = FALSE,
  pseudobulk.function = "AddRawCount",
  threads = 1,
  view.metrics.plot = TRUE,
  verbose = TRUE
)

Arguments

`object`	`DigitalDLSorter` object with `single.cell.real`/`single.cell.simul`, `prob.cell.matrix` and `bulk.simul` slots.
`type.data.train`	Type of profiles to be used for training. It can be `'both'`, `'single-cell'` or `'bulk'` (`'bulk'` by default).
`type.data.test`	Type of profiles to be used for evaluation. It can be `'both'`, `'single-cell'` or `'bulk'` (`'bulk'` by default).
`batch.size`	Number of samples per gradient update. If not specified, `batch.size` will default to 64.
`num.epochs`	Number of epochs to train the model (10 by default).
`num.hidden.layers`	Number of hidden layers of the neural network (2 by default). This number must be equal to the length of `num.units` argument.
`num.units`	Vector indicating the number of neurons per hidden layer (`c(200, 200)` by default). The length of this vector must be equal to `num.hidden.layers` argument.
`activation.fun`	Activation function to use (`'relu'` by default). See the keras documentation to know available activation functions.
`dropout.rate`	Float between 0 and 1 indicating the fraction of the input neurons to drop in layer dropouts (0.25 by default). By default, digitalDLSorteR implements 1 dropout layer per hidden layer.
`loss`	Character indicating loss function selected for model training (`'kullback_leibler_divergence'` by default). See the keras documentation to know available loss functions.
`metrics`	Vector of metrics used to assess model performance during training and evaluation (`c("accuracy", "mean_absolute_error", "categorical_accuracy")` by default). See the keras documentation to know available performance metrics.
`normalize`	Whether to normalize data using logCPM (`TRUE` by default). This parameter is only considered when the method used to simulate mixed transcriptional profiles (`simMixedProfiles` function) was `"AddRawCount"`. Otherwise, data were already normalized.
`scaling`	How to scale data before training. It may be: `"standardize"` (values are centered around the mean with a unit standard deviation) or `"rescale"` (values are shifted and rescaled so that they end up ranging between 0 and 1).
`norm.batch.layers`	Whether to include batch normalization layers between each hidden dense layer (`TRUE` by default).
`custom.model`	It allows to use a custom neural network. It must be a `keras.engine.sequential.Sequential` object in which the number of input neurons is equal to the number of considered features/genes, and the number of output neurons is equal to the number of cell types considered (`NULL` by default). If provided, the arguments related to the neural network architecture will be ignored.
`shuffle`	Boolean indicating whether data will be shuffled (`TRUE` by default). Note that if `bulk.simul` is not `NULL`, the data already has been shuffled and `shuffle` will be ignored.
`use.generator`	Boolean indicating whether to use generators during training and test. Generators are automatically used when `on.the.fly = TRUE` or HDF5 files are used, but it can be activated by the user on demand (`FALSE` by default).
`on.the.fly`	Boolean indicating whether data will be generated 'on the fly' during training (`FALSE` by default).
`pseudobulk.function`	Function used to build pseudo-bulk samples. It may be: `"MeanCPM"`: single-cell profiles (raw counts) are transformed into CPMs and cross-cell averages are calculated. Then, `log2(CPM + 1)` is calculated. `"AddCPM"`: single-cell profiles (raw counts) are transformed into CPMs and are added up across cells. Then, log-CPMs are calculated. `"AddRawCount"`: single-cell profiles (raw counts) are added up across cells. Then, log-CPMs are calculated.
`threads`	Number of threads used during simulation of pseudo-bulk samples if `on.the.fly = TRUE` (1 by default).
`view.metrics.plot`	Boolean indicating whether to show plots of loss and metrics progression during training (`TRUE` by default). keras for R allows to see the progression of the model during training if you are working in RStudio.
`verbose`	Boolean indicating whether to display model progression during training and model architecture information (`TRUE` by default).

Details

Keras/Tensorflow environment

All Deep Learning related steps in the digitalDLSorteR package are performed by using the keras package, an API in R for keras in Python available on CRAN. We recommend using the installTFpython function included in the package.

Simulation of bulk RNA-Seq profiles 'on the fly'

trainDDLSModel allows to avoid storing bulk RNA-Seq profiles by using on.the.fly argument. This functionality aims to avoid exexcution times and memory usage of the simBulkProfiles function, as the simulated pseudo-bulk profiles are built in each batch during training/evaluation.

Neural network architecture

By default, trainDDLSModel implements the architecture selected in Torroja and Sánchez-Cabo, 2019. However, as the default architecture may not produce good results depending on the dataset, it is possible to change its parameters by using the corresponding argument: number of hidden layers, number of neurons for each hidden layer, dropout rate, activation function and loss function. For more customized models, it is possible to provide a pre-built model in the custom.model argument (a keras.engine.sequential.Sequential object) where it is necessary that the number of input neurons is equal to the number of considered features/genes and the number of output neurons is equal to the number of considered cell types.

Value

A DigitalDLSorter object with trained.model slot containing a DigitalDLSorterDNN object. For more information about the structure of this class, see ?DigitalDLSorterDNN.

References

Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978

Examples

## Not run: 
set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10,
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
DDLS <- createDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE, 
  sc.log.FC = FALSE
)
probMatrixValid <- data.frame(
  Cell_Type = paste0("CellType", seq(2)),
  from = c(1, 30),
  to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
  object = DDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  prob.design = probMatrixValid,
  num.bulk.samples = 30,
  verbose = TRUE
)
# training of DDLS model
tensorflow::tf$compat$v1$disable_eager_execution()
DDLS <- trainDDLSModel(
  object = DDLS,
  on.the.fly = TRUE,
  batch.size = 12,
  num.epochs = 5
)

## End(Not run)

[Package digitalDLSorteR version 1.0.1 Index]