R: Prepare training and validation datasets

prepare.training.validation.datasets {SIMMS}

R Documentation

Prepare training and validation datasets

Description

Computes per-patient pathway-derived network impact scores across all input datasets, independently

Usage

prepare.training.validation.datasets(
  data.directory = ".",
  output.directory = ".",
  data.types = c("mRNA"),
  data.types.ordinal = c("cna"),
  min.ordinal.threshold = c(cna = 3),
  centre.data = "median",
  p.threshold = 0.5,
  feature.selection.datasets = NULL,
  datasets = NULL,
  truncate.survival = 100,
  networks.database = "default",
  write.normed.datasets = TRUE,
  subset = NULL
)

Arguments

`data.directory`	Path to the directory containing datasets as specified by `datasets`
`output.directory`	Path to the output folder where intermediate and results files will be saved
`data.types`	A vector of molecular datatypes to load. Defaults to c('mRNA')
`data.types.ordinal`	A vector of molecular datatypes to be treated as ordinal. Defaults to c('cna')
`min.ordinal.threshold`	A named vector specifying minimum percent threshold for each ordinal data type to be used prior to estimating coefficients. Coefficient for features not satisfying minimum threshold will not be estimated, and set to 0. Defaults to cna threshold as 3 percent
`centre.data`	A character string specifying the centre value to be used for scaling data. Valid values are: 'median', 'mean', or a user defined numeric threshold e.g. '0.3' when modelling methylation beta values. This value is used for both scaling as well as for dichotomising data for estimating univariate betas from Cox model. Defaults to 'median'
`p.threshold`	Cox P value threshold to be applied for selecting features (e.g. genes) which will contribute to patient risk score estimation. Defaults to 0.5
`feature.selection.datasets`	A vector containing names of datasets used for feature selection in function `derive.network.features()`
`datasets`	A vector containing names of all the datasets to be later used for training and validation purposes
`truncate.survival`	A numeric value specifying survival truncation in years. Defaults to 100 years which effectively means no truncation
`networks.database`	Name of the pathway networks database. Default to NCI PID/Reactome/Biocarta i-e "default"
`write.normed.datasets`	A toggle to control whether processed mRNA and survival data should be written to file
`subset`	A list with a Field and Entry component specifying a subset of patients to be selected whose annotation Field matches Entry

Value

The output files are stored under output.directory/output/

Author(s)

Syed Haider

Examples


# get data directory 
data.directory <- get.program.defaults()[["test.data.dir"]];

# initialise params
output.directory <- tempdir();
data.types <- c("mRNA");
feature.selection.datasets <- c("Breastdata1");
training.datasets <- c("Breastdata1");
validation.datasets <- c("Breastdata1", "Breastdata2");

# preparing training and validation datasets.
# Normalisation & patientwise subnet feature scores
prepare.training.validation.datasets(
  data.directory = data.directory,
  output.directory = output.directory,
  data.types =  data.types,
  feature.selection.datasets = feature.selection.datasets,
  datasets = unique(c(training.datasets, validation.datasets)),
  networks.database = "test"
  );

[Package SIMMS version 1.3.2 Index]