LPS {LPS}R Documentation

Linear Predictor Score fitting

Description

This function trains a Linear Predictor Score model, given pre-computed coefficients. It uses data with known classes to fit the model.

It has numerous way to be called, and all the arguments are not mandatory. See the 'Examples' section.

Usage

  LPS(data, coeff, response, k, threshold, formula, method = "fdr", ...)

Arguments

data

Continuous data used to retrieve classes, as a data.frame or matrix, with samples in rows and features (genes) in columns. Rows and columns should be named. Some precautions must be taken concerning data normalization, see the corresponding section below.

coeff

Pre-computed coefficients for the model, as returned by LPS.coeff (see there for format details).

response

Already known classes for the samples provided in data, preferably as a two-level factor. Can be missing if a formula with a response element is provided, but this argument precedes.

k

Single integer value, amount of features to include in the model, in decreasing order of coefficient. Can be missing if threshold or formula are provided, but this argument precedes other both of them.

threshold

Single numeric value, p-value threshold to apply for feature selection. Can be missing if k or formula are provided, but k precedes on it and it precedes on formula.

formula

A formula object, describing the model to fit (several templates are handled, see 'Examples'). The formula response element (before the "~" sign) can replace the response argument if it is not provided. The variables (after the "~" sign) can be a single integer (standing for the k argument), a single numeric (standing for the threshold argument) or a sum of feature names to use directly. "." is also handled in the usual way (all data columns), and "1" is a more efficient way to refer to all numeric columns of data.

method

Single character value, to be passed to p.adjust when threshold is provided.

...

Further arguments are passed to model.frame if response is missing (thus defined via formula). subset and na.action may be particularly useful for cross-validation schemes, see model.frame.default for details. subset is always handled but masked in "..." for compatibility reasons.

Value

An object of (S3) class "LPS" :

coeff

Named numeric vector, the coefficients used in the model.

classes

Character vector, the labels of the two groups to be predicted.

scores

List of two numeric vectors, training dataset scores sorted by group.

means

Numeric vector, score means of each group in the training dataset.

sds

Numeric vector, score sd of each group in the training dataset.

ovl

Numeric value, overlapping coefficient as returned by OVL.

k

Integer value, amount of features selected in the model (if relevant).

p.threshold

Numeric value, threshold used for feature selection (if relevant).

p.method

Character value, p-value correction used for feature selection (if relevant).

Normalization

As expression values are directly used in the score, gene centering and scaling are strongly recommended. For Affymetrix raw expression values (strictly positive, linear and absolute), Wright et al. suggests a multiplicative centering on a median of 1000 followed by a log2 transformation. For log-ratio, gene centering and scaling should not be necessary, as they are naturally 0-centered.

Time efficiency

Using a numeric matrix as data and a factor as response is the fastest way to compute coefficients, if time consumption matters (as in cross-validation schemes). formula is there only for consistency with R modeling functions, and to provide response, k or threshold in a single way.

Author(s)

Sylvain Mareschal

References

Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol. 2002;9(3):505-11.

Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.

Bohers E, Mareschal S, Bouzelfen A, Marchand V, Ruminy P, Maingonnat C, Menard AL, Etancelin P, Bertrand P, Dubois S, Alcantara M, Bastard C, Tilly H, Jardin F. Targetable activating mutations are very frequent in GCB and ABC diffuse large B-cell lymphoma. Genes Chromosomes Cancer. 2014 Feb;53(2):144-53.

See Also

LPS.coeff

Examples

  # Data with features in columns
  data(rosenwald)
  group <- rosenwald.cli$group
  expr <- t(rosenwald.expr)
  
  # NA imputation (feature's mean to minimize impact)
  f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x }
  expr <- apply(expr, 2, f)
  
  # Coefficients
  coeff <- LPS.coeff(data=expr, response=group)
  
  
  # 10 best features (straightforward)
  m <- LPS(data=expr, coeff=coeff, response=group, k=10)
  
  # 10 best features (formula)
  ### 'k' MUST be an integer, or will be understood as a 'threshold'
  ### Numbers are "numeric", enforce integer with "L" or "as.integer"
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~10L)
  k <- as.integer(10)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~k)
  
  # FDR threshold
  thr <- 0.01
  m <- LPS(data=expr, coeff=coeff, response=group, threshold=thr)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~0.01)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~thr)
  
  # Custom model
  m <- LPS(data=expr, coeff=coeff[ c("27481","17013") ,], response=group, k=2)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~`27481`+`17013`)
  ### Notice backticks in formula for syntactically invalid names
  
  # Complete model
  m <- LPS(data=expr, coeff=coeff, response=group, k=ncol(expr))
  m <- LPS(data=expr, coeff=coeff, response=group, threshold=1)
  ### m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~.)
  ### The last is correct but (really) slow on large datasets

[Package LPS version 1.0.16 Index]