panPca {micropan} | R Documentation |
Principal component analysis of a pan-matrix
Description
Computes a principal component decomposition of a pan-matrix, with possible scaling and weightings.
Usage
panPca(pan.matrix, scale = 0, weights = rep(1, ncol(pan.matrix)))
Arguments
pan.matrix |
A pan-matrix, see |
scale |
An optional scale to control how copy numbers should affect the distances. |
weights |
Vector of optional weights of gene clusters. |
Details
A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix. The principal components will in this case be linear combinations of the gene clusters. One major idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a high-dimensional space spanned by all gene clusters, we look for a few ‘smart’ combinations of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.
The ‘scale’ can be used to control how copy number differences play a role in the PCA. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to 2 (or more) copies is less. Prior to computing the PCA, the ‘pan.matrix’ is transformed according to the following affine mapping: If the original value in ‘pan.matrix’ is ‘x’, and ‘x’ is not 0, then the transformed value is ‘1 + (x-1)*scale’. Note that with ‘scale=0.0’ (default) this will result in 1 regardless of how large ‘x’ was. In this case the PCA only distinguish between presence and absence of gene clusters. If ‘scale=1.0’ the value ‘x’ is left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between 1 copy and 0 copies. For any ‘scale’ between 0.0 and 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA should be affected, and to what degree, by differences in copy numbers beyond 1.
The PCA may also up- or downweight some clusters compared to others. The vector ‘weights’ must
contain one value for each column in ‘pan.matrix’. The default is to use flat weights, i.e. all
clusters count equal. See geneWeights
for alternative weighting strategies.
Value
A list
with three tables:
‘Evar.tbl’ has two columns, one listing the component number and one listing the relative explained variance for each component. The relative explained variance always sums to 1.0 over all components. This value indicates the importance of each component, and it is always in descending order, the first component being the most important. This is typically the first result you look at after a PCA has been computed, as it indicates how many components (directions) you need to capture the bulk of the total variation in the data.
‘Scores.tbl’ has a column listing the ‘GID.tag’ for each genome, and then one column for each principal component. The columns are ordered corresponding to the elements in ‘Evar’. The scores are the coordinates of each genome in the principal component space.
‘Loadings.tbl’ is similar to ‘Scores.tbl’ but contain values for each gene cluster instead of each genome. The columns are ordered corresponding to the elements in ‘Evar’. The loadings are the contributions from each gene cluster to the principal component directions. NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the same value for every genome have no impact and are discarded from the ‘Loadings’.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
# Loading a pan-matrix in this package
data(xmpl.panmat)
# Computing panPca
ppca <- panPca(xmpl.panmat)
## Not run:
# Plotting explained variance
library(ggplot2)
ggplot(ppca$Evar.tbl) +
geom_col(aes(x = Component, y = Explained.variance))
# Plotting scores
ggplot(ppca$Scores.tbl) +
geom_text(aes(x = PC1, y = PC2, label = GID.tag))
# Plotting loadings
ggplot(ppca$Loadings.tbl) +
geom_text(aes(x = PC1, y = PC2, label = Cluster))
## End(Not run)