R: Get biomarkers discriminating between two classes

get.biom {BioMark}

R Documentation

Get biomarkers discriminating between two classes

Description

Biomarkers can be identified in several ways: the classical way is to look at those variables with large model coefficients or large t statistics. One other is based on the higher criticism approach (HC), and the third possibility assesses the stability of these coefficients under subsampling of the data set.

Usage

get.biom(X, Y, fmethod = "all", type = c("stab", "HC", "coef"),
         ncomp = 2, biom.opt = biom.options(), scale.p = "auto",
         ...)
## S3 method for class 'BMark'
coef(object, ...)
## S3 method for class 'BMark'
print(x, ...)
## S3 method for class 'BMark'
summary(object, ...)

Arguments

`X`	Data matrix. Usually the number of columns (variables) is (much) larger than the number of rows (samples).
`Y`	Class indication. For classification with two or more factors a factor; a numeric vector will be interpreted as a regression situation, which can only be tackled by `fmethod = "lasso"`.
`fmethod`	Modelling method(s) employed. The default is to use `"all"`, which will test all methods in the current `biom.options$fmethods` list. Note that from version 0.4.0, `"plsda"` and `"pclda"` are no longer in the list of methods - they have been replaced by `"pls"` and `"pcr"`, respectively. For compatibility reasons, using the old terms will not lead to an error but only a warning.
`type`	Whether to use coefficient size as a criterion (`"coef"`), `"stab"` or `"HC"`.
`ncomp`	Number of latent variables to use in PCR and PLS (VIP) modelling. In function `get.biom` this may be a vector; in all other functions it should be one number. Default: 2.
`biom.opt`	Options for the biomarker selection - a list with several named elements. See `biom.options`.
`scale.p`	Scaling. This is performed individually in every crossvalidation iteration, and can have a profound effect on the results. Default: "auto" (autoscaling). Other possible choices: "none" for no scaling, "pareto" for pareto scaling, "log" and "sqrt" for log and square root scaling, respectively.
`object`, `x`	A BMark object.
`...`	Further arguments for modelling functions. Often used to catch unused arguments.

Value

Function get.biom returns an object of class "BMark", a list containing an element for every fmethod that is selected, as well as an element info. The individual elements contain information depending on the type chosen: for type == "coef", the only element returned is a matrix containing coefficient sizes. For type == "HC" and type == "stab", a list is returned containing elements biom.indices, and either pvals (for type == "HC") or fraction.selected (for type == "stab"). Element biom.indices contains the indices of the selected variables, and can be extracted using function selection. Element pvals contains the p values used to perform HC thresholding; these are presented in the original order of the variables, and can be obtained directly from e.g. t statistics, or from permutation sampling. Element fraction.selected indicates in what fraction of the stability selection iterations a particular variable has been selected. The more often it has been selected, the more stable it is as a biomarker. Generic function coef.biom extracts model coefficients, p values or stability fractions for types "coef", "HC" and "stab", respectively.

Author(s)

Ron Wehrens

Examples

## Real apple data (small set)
data(spikedApples)
apple.coef <- get.biom(X = spikedApples$dataMatrix,
                       Y = factor(rep(1:2, each = 10)),
                       ncomp = 2:3, type = "coef")
coef.sizes <- coef(apple.coef) 
sapply(coef.sizes, range)

## stability-based selection
set.seed(17)
apple.stab <- get.biom(X = spikedApples$dataMatrix,
                       Y = factor(rep(1:2, each = 10)),
                       ncomp = 2:3, type = "stab")
selected.variables <- selection(apple.stab)
unlist(sapply(selected.variables, function(x) sapply(x, length)))
## Ranging from more than 70 for pcr, approx 40 for pls and student t,
## to 0-29 for the lasso
unlist(sapply(selected.variables,
              function(x) lapply(x, function(xx, y) sum(xx %in% y),
              spikedApples$biom)))
## TPs (stab): all find 5/5, except pcr.2 and the lasso with values for lambda
## larger than 0.0484

unlist(sapply(selected.variables,
              function(x) lapply(x, function(xx, y) sum(!(xx %in% y)),
              spikedApples$biom)))
## FPs (stab): PCR finds most FPs (approx. 60), other latent-variable
## methods approx 40, lasso allows for the optimal selection around 
## lambda = 0.0702

## regression example
data(gasoline) ## from the pls package
gasoline.stab <- get.biom(gasoline$NIR, gasoline$octane,
                          fmethod = c("pcr", "pls", "lasso"), type = "stab")


## Not run: 
## Same for HC-based selection
## Warning: takes a long time!
apple.HC <- get.biom(X = spikedApples$dataMatrix,
                     Y = factor(rep(1:2, each = 10)),
                     ncomp = 2:3, type = "HC")
sapply(apple.HC[names(apple.HC) != "info"],
       function(x, y) sum(x$biom.indices %in% y),
       spikedApples$biom)
sapply(apple.HC[names(apple.HC) != "info"],
       function(x, y) sum(!(x$biom.indices %in% y)),
       spikedApples$biom)

## End(Not run)

[Package BioMark version 0.4.5 Index]