SamplePCA {ClassDiscovery}R Documentation

Class "SamplePCA"


Perform principal components analysis on the samples (columns) from a microarray or proteomics experiment.


SamplePCA(data, splitter=0, usecor=FALSE, center=TRUE)
## S4 method for signature 'SamplePCA,missing'
plot(x, splitter=x@splitter, col, main='', which=1:2, ...)



Either a data frame or matrix with numeric values or an ExpressionSet as defined in the BioConductor tools for analyzing microarray data.


If data is a data frame or matrix, then splitter must be either a logical vector or a factor. If data is an ExpressionSet, then splitter can be a character string that names one of the factor columns in the associated phenoData subobject.


A logical value; should the rows of the data matrix be centered first?


A logical value; should the rows of the data matrix be scaled to have standard deviation 1?


A SamplePCA object


A list of colors to represent each level of the splitter in the plot. If this parameter is missing, the function will select colors automatically.


A character string; the plot title


A numeric vector of length two specifying which two principal components should be included in the plot.


Additional graphical parameters for plot



The main reason for developing the SamplePCA class is that the princomp function is very inefficient when the number of variables (in the microarray setting, genes) far exceeds the number of observations (in the microarray setting, biological samples). The princomp function begins by computing the full covariance matrix, which gets rather large in a study involving tens of thousands of genes. The SamplePCA class, by contrast, uses singular value decomposition (svd) on the original data matrix to compute the principal components.

The base functions screeplot, which produces a barplot of the percentage of variance explained by each component, and plot, which produces a scatter plot comparing two selected components (defaulting to the first two), have been generalized as methods for the SamplePCA class. You can add sample labels to the scatter plot using either the text or identify methods. One should, however, note that the current implementaiton of these methods only works when plotting the first two components.


The SamplePCA function constructs and returns an object of the SamplePCA class. We assume that the input data matrix has N columns (of biological samples) and P rows (of genes).

The predict method returns a matrix whose size is the number of columns in the input by the number of principal components.

Objects from the Class

Objects should be created using the SamplePCA function. In the simplest case, you simply pass in a data matrix and a logical vector, splitter, assigning classes to the columns, and the constructor performs principal components analysis on the column. The splitter is ignored by the constructor and is simply saved to be used by the plotting routines. If you omit the splitter, then no grouping structure is used in the plots.

If you pass splitter as a factor instead of a logical vector, then the plotting routine will distinguish all levels of the factor. The code is likely to fail, however, if one of the levels of the factor has zero representatives among the data columns.

We can also perform PCA on ExpressionSet objects from the BioConductor libraries. In this case, we pass in an ExpressionSet object along with a character string containing the name of a factor to use for splitting the data.



A matrix of size NxN, where N is the number of columns in the input, representing the projections of the input columns onto the first N principal components.


A numeric vector of length N; the amount of the total variance explained by each principal component.


A matrix of size PxN (the same size as the input matrix) containing each of the first P principal components as columns.


A logical vector or factor of length N classifying the columns into known groups.


A logical value; was the data standardized?


A numeric vector of length P; the mean vector of the input data, which is used for centering by the predict method.


A numeric vector of length P; the standard deviation of the input data, which is used for scaling by the predict method.


An object of class call that records how the object was created.



signature(x = SamplePCA, y = missing): Plot the samples in a two-dimensional principal component space.


signature(object = SamplePCA): Project new data into the principal component space.


signature(x = SamplePCA): Produce a bar chart of the variances explained by each principal component.


signature(object = SamplePCA): Write out a summary of the object.


signature(object = SamplePCA): interactively identify points in the plot of a SamplePCA object.


signature(object = SamplePCA): Add sample identifiers to the scatter plot of a SamplePCA object, using the base text function.


Kevin R. Coombes

See Also

princomp, GenePCA



## simulate data from three different groups
d1 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d2 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d3 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
dd <- cbind(d1, d2, d3)
kind <- factor(rep(c('red', 'green', 'blue'), each=10))
colnames(dd) <- paste(kind, rep(1:10, 3), sep='')

## perform PCA
spc <- SamplePCA(dd, splitter=kind)

## plot the results
plot(spc, col=levels(kind))

## mark the group centers
x1 <- predict(spc, matrix(apply(d1, 1, mean), ncol=1))
points(x1[1], x1[2], col='red', cex=2)
x2 <- predict(spc, matrix(apply(d2, 1, mean), ncol=1))
points(x2[1], x2[2], col='green', cex=2)
x3 <- predict(spc, matrix(apply(d3, 1, mean), ncol=1))
points(x3[1], x3[2], col='blue', cex=2)

## check out the variances

## cleanup
rm(d1, d2, d3, dd,kind, spc, x1, x2, x3)

[Package ClassDiscovery version 3.4.0 Index]