R: Supervised Grouping of Predictor Variables

pelora {supclust}

R Documentation

Supervised Grouping of Predictor Variables

Description

Performs selection and supervised grouping of predictor variables in large (microarray gene expression) datasets, with an option for simultaneous classification. Works in a greedy forward strategy and optimizes the binomial log-likelihood, based on estimated conditional probabilities from penalized logistic regression analysis.

Usage

pelora(x, y, u = NULL, noc = 10, lambda = 1/32, flip = "pm",
       standardize = TRUE, trace = 1)

Arguments

`x`	Numeric matrix of explanatory variables (`p` variables in columns, `n` cases in rows). For example, these can be microarray gene expression data which should be grouped.
`y`	Numeric vector of length `n` containing the class labels of the individuals. These labels have to be coded by 0 and 1.
`u`	Numeric matrix of additional (clinical) explanatory variables (`m` variables in columns, `n` cases in rows) that are used in the (penalized logistic regression) prediction model, but neither grouped nor averaged. For example, these can be 'traditional' clinical variables.
`noc`	Integer, the number of clusters that should be searched for on the data.
`lambda`	Real, defaults to 1/32. Rescaled penalty parameter that should be in `[0,1]`.
`flip`	Character string, describing a method how the `x` (gene expression) matrix should be sign-flipped. Possible are `"pm"` (the default) where the sign for each variable is determined upon its entering into the group, `"cor"` where the sign for each variable is determined a priori as the sign of the empirical correlation of that variable with the `y`-vector, and `"none"` where no sign-flipping is carried out.
`standardize`	Logical, defaults to `TRUE`. Is indicating whether the predictor variables (genes) should be standardized to zero mean and unit variance.
`trace`	Integer >= 0; when positive, the output of the internal loops is provided; `trace >= 2` provides output even from the internal C routines.

Value

pelora returns an object of class "pelora". The functions print and summary are used to obtain an overview of the variables (genes) that have been selected and the groups that have been formed. The function plot yields a two-dimensional projection into the space of the first two group centroids that pelora found. The generic function fitted returns the fitted values, these are the cluster representatives. coef returns the penalized logistic regression coefficients \theta_j for each of the predictors. Finally, predict is used for classifying test data with Pelora's internal penalized logistic regression classifier on the basis of the (gene) groups that have been found.

An object of class "pelora" is a list containing:

`genes`	A list of length `noc`, containing integer vectors consisting of the indices (column numbers) of the variables (genes) that have been clustered.
`values`	A numerical matrix with dimension `n \times \code{noc}`, containing the fitted values, i.e. the group centroids `\tilde{x}_j`.
`y`	Numeric vector of length `n` containing the class labels of the individuals. These labels are coded by 0 and 1.
`steps`	Numerical vector of length `noc`, showing the number of forward/backward cycles in the fitting process of each cluster.
`lambda`	The rescaled penalty parameter.
`noc`	The number of clusters that has been searched for on the data.
`px`	The number of columns (genes) in the `x`-matrix.
`flip`	The method that has been chosen for sign-flipping the `x`-matrix.
`var.type`	A factor with `noc` entries, describing whether the `j`th predictor is a group of predictors (genes) or a single (clinical) predictor variable.
`crit`	A list of length `noc`, containing numerical vectors that provide information about the development of the grouping criterion during the clustering.
`signs`	Numerical vector of length `p`, saying whether the `i`th variable (gene) should be sign-flipped (-1) or not (+1).
`samp.names`	The names of the samples (rows) in the `x`-matrix.
`gene.names`	The names of the variables (columns) in the `x`-matrix.
`call`	The function call.

Author(s)

Marcel Dettling, dettling@stat.math.ethz.ch

References

Marcel Dettling (2003) Finding Predictive Gene Groups from Microarray Data, see https://stat.ethz.ch/~dettling/supervised.html

Marcel Dettling and Peter Bühlmann (2002). Supervised Clustering of Genes. Genome Biology, 3(12): research0069.1-0069.15, doi: 10.1186/gb-2002-3-12-research0069.

Marcel Dettling and Peter Bühlmann (2004). Finding Predictive Gene Groups from Microarray Data. Journal of Multivariate Analysis 90, 106–131, doi: 10.1016/j.jmva.2004.02.012

Examples

## Working with a "real" microarray dataset
data(leukemia, package="supclust")

## Generating random test data: 3 observations and 250 variables (genes)
set.seed(724)
xN <- matrix(rnorm(750), nrow = 3, ncol = 250)

## Fitting Pelora
fit <- pelora(leukemia.x, leukemia.y, noc = 3)

## Working with the output
fit
summary(fit)
plot(fit)
fitted(fit)
coef(fit)

## Fitted values and class probabilities for the training data
predict(fit, type = "cla")
predict(fit, type = "prob")

## Predicting fitted values and class labels for the random test data
predict(fit, newdata = xN)
predict(fit, newdata = xN, type = "cla", noc = c(1,2,3))
predict(fit, newdata = xN, type = "pro", noc = c(1,3))

## Fitting Pelora such that the first 70 variables (genes) are not grouped
fit <- pelora(leukemia.x[, -(1:70)], leukemia.y, leukemia.x[,1:70])

## Working with the output
fit
summary(fit)
plot(fit)
fitted(fit)
coef(fit)

## Fitted values and class probabilities for the training data
predict(fit, type = "cla")
predict(fit, type = "prob")

## Predicting fitted values and class labels for the random test data
predict(fit, newdata = xN[, -(1:70)], newclin = xN[, 1:70])
predict(fit, newdata = xN[, -(1:70)], newclin = xN[, 1:70], "cla", noc  = 1:10)
predict(fit, newdata = xN[, -(1:70)], newclin = xN[, 1:70], type = "pro")

[Package supclust version 1.1-1 Index]