gl.pcoa {dartR.base}R Documentation

Ordination applied to genotypes in a genlight object (PCA), in an fd object, or to a distance matrix (PCoA)

Description

This function takes the genotypes for individuals and undertakes a Pearson Principal Component analysis (PCA) on SNP or Sequence tag P/A (SilicoDArT) data; it undertakes a Gower Principal Coordinate analysis (PCoA) if supplied with a distance matrix.

Usage

gl.pcoa(
  x,
  nfactors = 5,
  correction = NULL,
  mono.rm = TRUE,
  parallel = FALSE,
  n.cores = 1,
  plot.out = TRUE,
  plot_theme = theme_dartR(),
  plot_colors = gl.colors(2),
  plot.file = NULL,
  plot.dir = NULL,
  verbose = NULL
)

Arguments

x

Name of the genlight object or fd object containing the SNP data, or a distance matrix of type dist [required].

nfactors

Number of axes to retain in the output of factor scores [default 5].

correction

Method applied to correct for negative eigenvalues, either 'lingoes' or 'cailliez' [Default NULL].

mono.rm

If TRUE, remove monomorphic loci [default TRUE].

parallel

TRUE if parallel processing is required (does fail under Windows) [default FALSE].

n.cores

Number of cores to use if parallel processing is requested [default 16].

plot.out

If TRUE, a diagnostic plot is displayed showing a scree plot for the "informative" axes and a histogram of eigenvalues of the remaining "noise" axes [Default TRUE].

plot_theme

Theme for the plot. See Details for options [default theme_dartR()].

plot_colors

List of two color names for the borders and fill of the plot [default gl.colors(2)].

plot.file

Name for the RDS binary file to save (base name only, exclude extension) [default NULL]

plot.dir

Directory to save the plot RDS files [default as specified by the global working directory or tempdir()]

verbose

verbose= 0, silent or fatal errors; 1, begin and end; 2, progress log; 3, progress and results summary; 5, full report [default 2 or as specified using gl.set.verbosity].

Details

The function is essentially a wrapper for glPca {adegenet} or pcoa {ape} with default settings apart from those specified as parameters in this function. Sources of stress in the visual representation While, technically, any distance matrix can be represented in an ordinated space, the representation will not typically be exact.There are three major sources of stress in a reduced-representation of distances or dissimilarities among entities using PCA or PCoA. By far the greatest source comes from the decision to select only the top two or three axes from the ordinated set of axes derived from the PCA or PCoA. The representation of the entities such a heavily reduced space will not faithfully represent the distances in the input distance matrix simply because of the loss of information in deeper informative dimensions. For this reason, it is not sensible to be too precious about managing the other two sources of stress in the visual representation. The measure of distance between entities in a PCA is the Pearson Correlation Coefficient, essentially a standardized Euclidean distance. This is both a metric distance and a Euclidean distance and so the distances in the final ordination are faithful to those between entities in the dataset. Note that missing values are imputed in PCA, and that this can be a source of disparity between the distances between the entities in the dataset and the distances in the ordinated space.

In PCoA, the second source of stress is the choice of distance measure or dissimilarity measure. While pretty much any distance or dissimilarity matrix can be represented in an ordinated space, the distances between entities can be faithfully represented in that space (that is, without stress) only if the distances are metric. Furthermore, for distances between entities to be faithfully represented in a rigid Cartesian space, the distance measure needs to be Euclidean. If this is not the case, the distances between the entities in the ordinated visualized space will not #' exactly represent the distances in the input matrix (stress will be non-zero). This source of stress will be evident as negative eigenvalues in the deeper dimensions.

A third source of stress arises from having a sparse dataset, one with missing values. This affects both PCA and PCoA. If the original data matrix is not fully populated, that is, if there are missing values, then even a Euclidean distance matrix will not necessarily be 'positive definite'. It follows that some of the eigenvalues may be negative, even though the distance metric is Euclidean. This issue is exacerbated when the number of loci greatly exceeds the number of individuals, as is typically the case when working with SNP data. The impact of missing values can be minimized by stringently filtering on Call Rate, albeit with loss of data. An alternative is given in a paper 'Honey, I shrunk the sample covariance matrix' and more recently by Ledoit and Wolf (2018), but their approach has not been implemented here.

Options for imputing missing values while minimizing distortion are provided in the function gl.impute(). The good news is that, unless the sum of the negative eigenvalues, arising from a non-Euclidean distance measure or from missing values, approaches those of the final PCA or PCoA axes to be displayed, the distortion is probably of no practical consequence and certainly not comparable to the stress arising from selecting only two or three final dimensions out of several informative dimensions for the visual representation. Function's output Two diagnostic plots are produced. The first is a Scree Plot, showing the percentage variation explained by each of the PCA or PCoA axes, for those axes are considered informative. The scree plot informs a decision on the number of dimensions to be retained in the visual summaries. Various approaches are available for identifying which axes are informative (in terms of containing biologically significant variation) and which are noise axes. The simplest method is to consider only those axes that explain more variance than the original variables on average as being informative (the Kaiser-Guttman criterion).

Once you have the informative axes identified, a judgement call is made as to how many dimensions to retain and present as results. This requires a decision on how much information on structure in the data is to be discarded. Retaining at least those axes that explain 10 The second graph is for diagnostic purposes only. It shows the distribution of eigenvalues for the remaining uninformative (noise) axes, including those with negative eigenvalues. If a plot.file is given, the ggplot arising from this function is saved as an binary file using gl.save(); can be reloaded with gl.load(). A file name must be specified for the plot to be saved. If a plot directory (plot.dir) is specified, the ggplot binary is saved to that directory; otherwise to the tempdir(). Action is recommended (verbose >= 2) if the negative eigenvalues are dominant, their sum approaching in magnitude the eigenvalues for axes selected for the final visual solution. Output is a glPca object conforming to adegenet::glPca but with only the following retained.

Examples of other themes that can be used can be consulted in

PCA was developed by Pearson (1901) and Hotelling (1933), whilst the best modern reference is Jolliffe (2002). PCoA was developed by Gower (1966) while the best modern reference is Legendre & Legendre (1998).

Value

An object of class pcoa containing the eigenvalues and factor scores

Author(s)

Author(s): Arthur Georges. Custodian: Arthur Georges (Post to https://groups.google.com/d/forum/dartr)

References

See Also

gl.pcoa.plot

Examples

# PCA (using SNP genlight object)
gl <- possums.gl
pca <- gl.pcoa(possums.gl[1:50,],verbose=2)
gl.pcoa.plot(pca,gl)

gs <- testset.gs
levels(pop(gs))<-c(rep('Coast',5),rep('Cooper',3),rep('Coast',5),
rep('MDB',8),rep('Coast',6),'Em.subglobosa','Em.victoriae')
# PCA (using SilicoDArT genlight object)
pca <- gl.pcoa(gs)
gl.pcoa.plot(pca,gs)
# Using a distance matrix
D <- gl.dist.ind(testset.gs, method='jaccard')
pcoa <- gl.pcoa(D,correction="cailliez")
gl.pcoa.plot(pcoa,gs)


[Package dartR.base version 0.65 Index]