run_TIGER {TIGERr}R Documentation

Run TIGER to eliminate technical variation

Description

Use TIGER algorithm to eliminate the technical variation in metabolomics data. TIGER supports targeted and untargeted metabolomics data and is competent to perform both intra- and inter-batch technical variation removal.

Usage

run_TIGER(
  test_samples,
  train_samples,
  col_sampleID,
  col_sampleType,
  col_batchID,
  col_order = NULL,
  col_position = NULL,
  targetVal_external = NULL,
  targetVal_method = c("mean", "median"),
  targetVal_batchWise = FALSE,
  targetVal_removeOutlier = !targetVal_batchWise,
  selectVar_external = NULL,
  selectVar_corType = c("cor", "pcor"),
  selectVar_corMethod = c("pearson", "spearman"),
  selectVar_minNum = 5,
  selectVar_maxNum = 10,
  selectVar_batchWise = FALSE,
  mtry_percent = seq(0.2, 0.8, 0.2),
  nodesize_percent = seq(0.2, 0.8, 0.2),
  ...,
  parallel.cores = 2
)

Arguments

test_samples

(required) a data.frame containing the samples to be corrected (for example, subject samples). This data.frame should contain columns of

  • sample ID (required): name or label for each sample,

  • sample type (required): indicating the type of each sample,

  • batch ID (required): the batch of each sample,

  • order information (optional): injection order or temporal information of each sample,

  • position information (optional): well position of each sample,

  • metabolite values (required): values to be normalised. Infinite values are not allowed.

Row: sample. Column: variable. See Examples.

train_samples

(required) a data.frame containing the quality control (QC) samples used for model training. The columns in this data.frame should correspond to the columns in test_samples. And test_samples and train_samples should have the identical column names.

col_sampleID

(required) a character string indicating the name of the column that specifies the sample ID of each sample. The values in this column will not affect the data correction process but can act as labels for different samples. See Examples.

col_sampleType

(required) a character string indicating the name of the column that specifies the type (such as QC1, QC2, subject) of each sample. This column can be used to indicate different kinds of QC samples in train_samples. QC samples of the same type should have the same type name. See Examples.

col_batchID

(required) a character string indicating the name of the column that specifies the batch ID of each sample. See Examples.

col_order

(optional) NULL or a character string indicating the name of the column that contains the injection order or temporal information (numeric values). This can explicitly ask the algorithm capture the technical variation introduced by injection order, which might be useful when your data have very obvious temporal drifts. If NULL (default), train_samples and test_samples should have No column contains injection order information.

col_position

(optional) NULL or a character string indicating the name of the column that contains the well position information (numeric values). This can explicitly ask the algorithm capture the technical variation introduced by well position, which might be useful when the well position has a great impact during data acquisition. If NULL (default), train_samples and test_samples should have No column contains well position information.

targetVal_external

(optional) a list generated by function compute_targetVal. See Details.

targetVal_method

a character string specifying how target values are to be computed. Can be "mean" (default) or "median". Ignored if a list of external target values has been assigned to targetVal_external.

targetVal_batchWise

logical. If TRUE, the target values will be computed based on each batch, otherwise, based on the whole dataset. Setting TRUE might be useful if your dataset has very obvious batch effects, but this may also make the algorithm less robust. Default: FALSE. Ignored if a list of external target values has been assigned to targetVal_external.

targetVal_removeOutlier

logical. If TRUE, outliers will be removed before the computation. Outliers are determined with 1.5 * IQR (interquartile range) rule. We recommend turning this off when the target values are computed based on batches. Default: !targetVal_batchWise. Ignored if a list of external target values has been assigned to targetVal_external.

selectVar_external

(optional) a list generated by function select_variable. See Details.

selectVar_corType

a character string indicating correlation ("cor", default) or partial correlation ("pcor") is to be used. Can be abbreviated. Ignored if a list of selected variables has been assigned to selectVar_external. Note: computing partial correlations of a large dataset can be very time-consuming.

selectVar_corMethod

a character string indicating which correlation coefficient is to be computed. One of "spearman" (default) or "pearson". Can be abbreviated. Ignored if a list of selected variables has been assigned to selectVar_external.

selectVar_minNum

an integer specifying the minimum number of the selected metabolite variables (injection order and well position are not regarded as metabolite variables). If NULL, no limited, but 1 at least. Default: 5. Ignored if a list of selected variables has been assigned to selectVar_external.

selectVar_maxNum

an integer specifying the maximum number of the selected metabolite variables (injection order and well position are not regarded as metabolite variables). If NULL, no limited, but no more than the number of all available metabolite variables. Default: 10. Ignored if a list of selected variables has been assigned to selectVar_external.

selectVar_batchWise

(advanced) logical. Specify whether the variable selection should be performed based on each batch. Default: FALSE. Ignored if a list of selected variables has been assigned to selectVar_external. Note: the support of batch-wise variable selection is provided for data requiring special processing (for example, data with strong batch effects). But in most case, batch-wise variable selection is not necessary. Setting TRUE can make the algorithm less robust.

mtry_percent

(advanced) a numeric vector indicating the percentages of selected variables randomly sampled as candidates at each split when training random forest models (base learners). Note: providing more arguments will include more base learners into the ensemble model, which will increase the processing time. Default: seq(0.2, 0.8, 0.2).

nodesize_percent

(advanced) a numeric vector indicating the percentages of sample size used as the minimum sizes of the terminal nodes in random forest models (base learners). Note: providing more arguments will include more base learners into the ensemble model, which will increase the processing time. Default: seq(0.2, 0.8, 0.2).

...

(advanced) optional arguments (except mtry and nodesize) to be passed to randomForest for model training. Arguments mtry and nodesize are determined by mtry_percent and nodesize_percent. See randomForest and Examples. Note: providing more arguments will include more base learners into the ensemble model, which will increase the processing time.

parallel.cores

an integer (== -1 or >= 1) specifying the number of cores for parallel computation. Setting -1 to run with all cores. Default: 2.

Details

TIGER can effectively process the datasets with its default setup. The following hyperparameters are provided to customise the algorithm and achieve the best possible performance. These hyperparameters are also practical for some special purposes (such as cross-kit adjustment, longitudinal dataset correction) or datasets requiring special processing (for example, data with very strong temporal drifts or batch effects). We recommend users to examine the normalised result with different metrics, such as RSD (relative standard deviation), MAPE (mean absolute percentage error) and PCA (principal component analysis), especially when using advanced options of TIGER.

Hyperparameters for target value computation

Hyperparameters for variable selection

Hyperparameters for model construction

Value

This function returns a data.frame with the same data structure as the input test_samples, but the metabolite values are the normalised/corrected ones. NA and zeros in the original test_samples will not be changed or normalised.

Reference

Han S. et al. TIGER: technical variation elimination for metabolomics data using ensemble learning architecture. Briefings in Bioinformatics (2022) bbab535. doi: 10.1093/bib/bbab535.

Examples


data(FF4_qc) # load demo dataset

# QC as training samples; QC1, QC2 and QC3 as test samples:
train_samples <- FF4_qc[FF4_qc$sampleType == "QC",]
test_samples  <- FF4_qc[FF4_qc$sampleType != "QC",]

# col_sampleID includes labels. You can assign names for different samples:
train_samples$sampleID <- "train"
test_samples$sampleID  <- "test"

# Use default setting and
# include injection order and well position into feature set:
test_norm_1 <- run_TIGER(test_samples = test_samples,
                         train_samples = train_samples,
                         col_sampleID  = "sampleID",     # input column name
                         col_sampleType = "sampleType",  # input column name
                         col_batchID = "plateID",        # input column name
                         col_order = "injectionOrder",   # input column name
                         col_position = "wellPosition",  # input column name
                         parallel.cores = 2)

# If the information of injection order and well position is not available,
# or you don't want to use them:
train_data <- train_samples[-c(4:5)]  # remove the two columns
test_data  <- test_samples[-c(4:5)]   # remove the two columns

test_norm_2 <- run_TIGER(test_samples = test_data,
                         train_samples = train_data,
                         col_sampleID  = "sampleID",
                         col_sampleType = "sampleType",
                         col_batchID = "plateID",
                         col_order = NULL,                # set NULL
                         col_position = NULL,             # set NULL
                         parallel.cores = 2)

# If use external target values and selected variables with
# customised settings:
target_val <- compute_targetVal(QC_num = train_samples[-c(1:5)],
                                sampleType = train_samples$sampleType,
                                batchID = train_samples$plateID,
                                targetVal_method = "median",
                                targetVal_batchWise = TRUE)

select_var <- select_variable(train_num = train_samples[-c(1:5)],
                              test_num = test_samples[-c(1:5)],
                              train_batchID = train_samples$plateID,
                              test_batchID = test_samples$plateID,
                              selectVar_corType = "pcor",
                              selectVar_corMethod = "spearman",
                              selectVar_minNum = 10,
                              selectVar_maxNum = 30,
                              selectVar_batchWise = TRUE)

test_norm_3 <- run_TIGER(test_samples = test_samples,
                         train_samples = train_samples,
                         col_sampleID  = "sampleID",
                         col_sampleType = "sampleType",
                         col_batchID = "plateID",
                         col_order = "injectionOrder",
                         col_position = "wellPosition",
                         targetVal_external = target_val,
                         selectVar_external = select_var,
                         parallel.cores = 2)

# The definitions of other hyperparameters correspond to
# randomForest::randomForest().
# If want to include more hyperparameters into model training,
# put hyperparameter values like this:
mtry_percent <- c(0.4, 0.8)
nodesize_percent <- c(0.4, 0.8)
replace <- c(TRUE, FALSE)
ntree <- c(100, 200, 300)

test_norm_4 <- run_TIGER(test_samples = test_data,
                         train_samples = train_data,
                         col_sampleID  = "sampleID",
                         col_sampleType = "sampleType",
                         col_batchID = "plateID",
                         mtry_percent = mtry_percent,
                         nodesize_percent = nodesize_percent,
                         replace = replace,
                         ntree = ntree,
                         parallel.cores = 2)

# test_norm_4 is corrected by the ensemble model consisted of base learners
# trained with (around) 24 different hyperparameter combinations:
expand.grid(mtry_percent, nodesize_percent, replace, ntree)

# Note: mtry and nodesize are calculated by mtry_percent and nodesize_percent,
#       duplicated hyperparameter combinations, if any, will be removed.
#       Thus, the total number of hyperparameter combinations can be less than 24.
#       This is determined by the shape of your input datasets.


[Package TIGERr version 1.0.0 Index]