select_variable {TIGERr}R Documentation

Select variables for ensemble learning architecture

Description

This function provides an advanced option to select metabolite variables from external dataset(s). The selected variables (as a list) can be further passed to argument selectVar_external in function run_TIGER for a customised data correction.

Usage

select_variable(
  train_num,
  test_num = NULL,
  train_batchID = NULL,
  test_batchID = NULL,
  selectVar_corType = c("cor", "pcor"),
  selectVar_corMethod = c("spearman", "pearson"),
  selectVar_minNum = 5,
  selectVar_maxNum = 10,
  selectVar_batchWise = FALSE,
  coerce_numeric = FALSE
)

Arguments

train_num

a numeric data.frame only including the metabolite values of training samples (can be quality control samples). Information such as injection order or well position need to be excluded. Row: sample. Column: metabolite variable. See Examples.

test_num

an optional numeric data.frame including the metabolite values of test samples (can be subject samples). If provided, the column names of test_num should correspond to the column names of train_num. Row: sample. Column: metabolite variable. If NULL, the variables will be selected based on train_num only. See Examples.

train_batchID

NULL or a vector corresponding to train_num to specify the batch of each sample. Ignored if selectVar_batchWise = FALSE. See Examples.

test_batchID

NULL or a vector corresponding to test_num to specify the batch of each sample. Ignored if selectVar_batchWise = FALSE. See Examples.

selectVar_corType

a character string indicating correlation ("cor", default) or partial correlation ("pcor") is to be used. Can be abbreviated. See Details. Note: computing partial correlations of a large dataset can be very time-consuming.

selectVar_corMethod

a character string indicating which correlation coefficient is to be computed. One of "spearman" (default) or "pearson". Can be abbreviated. See Details.

selectVar_minNum

an integer specifying the minimum number of the selected variables. If NULL, no limited, but 1 at least. See Details. Default: 5.

selectVar_maxNum

an integer specifying the maximum number of the selected variables. If NULL, no limited, but ncol(train_num) - 1 at most. See Details. Default: 10.

selectVar_batchWise

(advanced) logical. Specify whether the variable selection should be performed based on each batch. Default: FALSE. Note: if TRUE, batch ID of each sample are required. The support of batch-wise variable selection is provided for data requiring special processing (for example, data with strong batch effects). But in most case, batch-wise variable selection is not necessary. Setting TRUE might make the algorithm less robust. See Details.

coerce_numeric

logical. If TRUE, values in train_num and test_num will be coerced to numeric before the computation. The columns cannot be coerced will be removed (with warnings). See Examples. Default: FALSE.

Details

See run_TIGER.

Value

If selectVar_batchWise = FALSE, the function returns a list of length one containing the selected variables computed on the whole dataset.

If selectVar_batchWise = TRUE, a list containing the selected variables computed on different batches is returned. The length of the returned list equals the number of batch specified by test_batchID and/or train_batchID.

Examples


data(FF4_qc) # load demo dataset

# QC as training samples; QC1, QC2 and QC3 as test samples:
train_samples <- FF4_qc[FF4_qc$sampleType == "QC",]
test_samples  <- FF4_qc[FF4_qc$sampleType != "QC",]

# Only numeric data of metabolite variables are allowed:
train_num = train_samples[-c(1:5)]
test_num  = test_samples[-c(1:5)]

# If the selection is performed on the whole dataset:
# based on training samples only:
selected_var_1 <- select_variable(train_num = train_num,
                                  test_num  = NULL,
                                  selectVar_batchWise = FALSE)

# also consider test samples:
selected_var_2 <- select_variable(train_num = train_num,
                                  test_num  = test_num,
                                  selectVar_batchWise = FALSE)

# If the selection is based on different batches:
# (In selectVar_batchWise, batch ID is required.)
selected_var_3 <- select_variable(train_num = train_num,
                                  test_num  = NULL,
                                  train_batchID = train_samples$plateID,
                                  test_batchID  = NULL,
                                  selectVar_batchWise = TRUE)

# If coerce_numeric = TRUE,
# columns cannot be coerced to numeric will be removed (with warnings):
# (In this example, columns of injection order and well position are excluded.
# Because we don't want to calculate the correlations between metabolites and
# injection order/well position.)
selected_var_4 <- select_variable(train_num = train_samples[-c(4,5)],
                                  train_batchID = train_samples$plateID,
                                  selectVar_batchWise = TRUE,
                                  coerce_numeric = TRUE)
identical(selected_var_3, selected_var_4)  # identical to selected_var_3

## Not run: 

# will throw errors if input data have non-numeric columns
# and coerce_numeric = FALSE:

selected_var_5 <- select_variable(train_num = train_samples[-c(4,5)],
                                  coerce_numeric = FALSE)

## End(Not run)

[Package TIGERr version 1.0.0 Index]