plotSSAllo {polysat}R Documentation

Perform Allele Assignments across Entire Dataset

Description

processDatasetAllo runs alleleCorrelations on every locus in a "genambig" object, then runs testAlGroups on every locus using several user-specified parameter sets. It chooses a single best set of allele assignments for each locus, and produces plots to help the user evaluate assignment quality. plotSSAllo assists the user in evaluating the quality of allele assignments by plotting the results of K-means clustering. plotParamHeatmap assists the user in choosing the best parameter set for testAlGroups for each locus.

Usage

plotSSAllo(AlCorrArray)
plotParamHeatmap(propMat, popname = "AllInd", col = grey.colors(12)[12:1], main = "")
processDatasetAllo(object, samples = Samples(object), loci = Loci(object),
                   n.subgen = 2, SGploidy = 2, n.start = 50, alpha = 0.05,
                   parameters = data.frame(tolerance     = c(0.05, 0.05,  0.05, 0.05),
                                           swap          = c(TRUE, FALSE, TRUE, FALSE),
                                           null.weight   = c(0.5,  0.5,   0,    0)),
                   plotsfile = "alleleAssignmentPlots.pdf", usePops = FALSE, ...)

Arguments

AlCorrArray

A two-dimensional list, where each item in the list is the output of alleleCorrelations. The first dimension represents loci, and the second dimension represents populations. Both dimensions are named. This is the $AlCorrArray output of processDatasetAllo.

propMat

A two-dimensional array, with loci in the first dimension and parameter sets in the second dimension, indicating the proportion of alleles that were found to be homoplasious by testAlGroups or the proportion of genotypes that could not be recoded using a given set of allele assignments. This can be the $propHomoplasious output of processDatasetAllo, indexed by a single population. If a three-dimensional array is provided, it will be indexed in the second dimension by popname. The $propHomoplMerged or $missRate output of processDatasetAllo may also be passed to this argument.

popname

The name of the population corresponding to the data in propMat.

col

The color scale for representing the proportion of loci that are homoplasious or the proportion of genotypes that are missing.

main

A title for the plot.

object

A "genambig" object.

samples

An optional character vector indicating which samples to include in analysis.

loci

An optional character vector indicating which loci to include in analysis.

n.subgen

The number of isoloci into which each locus should be split. Passed directly to alleleCorrelations.

SGploidy

The ploidy of each isolocus. Passed directly to testAlGroups.

n.start

Passed directly to the nstart argument of kmeans. See alleleCorrelations.

alpha

The significance threshold for determining whether two alleles are significantly correlated. Used primarily for identifying potentially problematic positive correlations. Passed directly to alleleCorrelations.

parameters

Data frame indicating parameter sets to pass to testAlGroups. Each row is one set of parameters.

plotsfile

A PDF output file name for drawing plots to help assess assignment quality. Can be NULL if no plots are desired.

usePops

If TRUE, population assignments are taken from the PopInfo slot of object, and populations are analyzed separately with alleleCorrelations and testAlGroups, before merging the results with mergeAlleleAssignments.

...

Additional parameters to pass to testAlGroups for adjusting the simulated annealing algorithm.

Details

plotSSAllo produces a plot of loci by population, with the sums-of-squares ratio on the x-axis and the evenness of allele distribution on the y-axis (see Value). Locus names are written directly on the plot. If there are multiple population names, locus names are colored by population, and a legend is provided for colors. Loci with high-quality allele clustering are expected to be in the upper-right quadrant of the plot. If locus names are in italics, it indicates that positive correlations were found between some alleles, indicating population structure or scoring error that could interfere with assignment quality.

plotParamHeatmap produces an image to indicate the proportion of alleles found to be homoplasious, or the proportion of genotypes that could not be unambiguously recoded using allele assignments, for each locus and parameter set for a given population (when looking at homoplasy) or merged across populations (for homoplasy or the proportion of non-recodeable genotypes). Darker colors indicate more homoplasy or more genotypes that could not be recoded. The word “best” indicates, for each locus, the parameter set that found the least homoplasy or smallest number of non-recodeable genotypes.

By default, processDatasetAllo generates a PDF file containing output from plotSSAllo and plotParamHeatmap, as well as heatmaps of the $heatmap.dist output of alleleCorrelations for each locus and population. Heatmaps are not plotted for loci where an allele is present in all individuals. processDatasetAllo also generates a list of R objects containing allele assignments under different parameters, as well as statistics for evaluating clustering quality and choosing the optimal parameter sets, as described below.

Value

plotSSAllo draws a plot and invisibly returns a list:

ssratio

A two-dimensional array with loci in the first dimension and populations in the second dimension. Each value is the sums-of-squares between isoloci divided by the total sums-of-squares, as output by K-means clustering. If K-means clustering was not performed, the value is zero.

evenness

An array of the same dimensions as $ssratio, containing values to indicate how evenly alleles are distributed among isoloci as determined by K-means clustering. This is:

1 - \sum_{1}^{i}{(\frac{a_{i}}{A})^2}

where i is the number of isoloci, a_i is the number of alleles for a given isolocus, and A is the total number of alleles for the locus.

max.evenness

The maximum possible value for $evenness, given the number of isoloci.

min.evenness

The minimum possible value for $evenness, given the number of isoloci and alleles.

posCor

An array of the same dimensions as $ssratio, containing TRUE if there were any positive correlations between alleles, and FALSE if not.

processDatasetAllo returns a list:

AlCorrArray

A two-dimensional list with loci in the first dimension and populations in the second dimension, giving the results of alleleCorrelations.

TAGarray

A three-dimensional list with loci in the first dimension, populations in the second dimension, and parameter sets in the third dimension, giving the results of testAlGroups.

plotSS

The output of plotSSAllo.

propHomoplasious

A three-dimensional array, with the same dimensions as $TAGarray, indicating the proportion of alleles that were found to be homoplasious for each locus, population, and parameter set.

mergedAssignments

A two-dimensional list, with loci in the first dimension and parameter sets in the second dimension, containing allele assignments merged across populations. This is the output of mergeAlleleAssignments.

propHomoplMerged

A two-dimensional array, of the same dimensions as $mergedAssignments, indicating the proportion of alleles that were homoplasious, for each locus and parameter set, for allele assignments that were merged across populations.

missRate

A matrix with the same dimensions as $mergedAssignments indicating the proportion of non-missing genotypes from the original dataset that cannot be unambiguously recoded, without invoking aneuploidy, using the merged allele assignments from each parameter set for each locus.

bestAssign

A one-dimensional list with a single best set of allele assignments, from $mergedAssignments, for each locus. The best set of assignments is chosen using $missRate, then in the case of a tie using $propHomoplMerged, then in the case of a tie using the parameter set that was listed first.

plotParamHeatmap draws a plot and does not return anything.

Author(s)

Lindsay V. Clark

References

Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between allelic variables. Molecular Ecology Resources, 17, 1090–1103. DOI: 10.1111/1755-0998.12639.

See Also

alleleCorrelations, recodeAllopoly

Examples

# get example dataset
data(AllopolyTutorialData)

# data cleanup
mydata <- deleteSamples(AllopolyTutorialData, c("301", "302", "303"))
PopInfo(mydata) <- rep(1:2, each = 150)
Genotype(mydata, 43, 2) <- Missing(mydata)

# allele assignments
# R is set to 10 here to speed processing for example.  It should typically be left at the default.
myassign <- processDatasetAllo(mydata, loci = c("Loc3", "Loc6"),
                               plotsfile = NULL, usePops = TRUE, R = 10,
                               parameters = data.frame(tolerance = c(0.5, 0.5), 
                               swap = c(TRUE, FALSE),
                               null.weight = c(0.5, 0.5)))

# view best assignments for each locus
myassign$bestAssign
                               
# plot K-means results
plotSSAllo(myassign$AlCorrArray)

# plot proportion of homoplasious alleles
plotParamHeatmap(myassign$propHomoplasious, "Pop1")
plotParamHeatmap(myassign$propHomoplasious, "Pop2")
plotParamHeatmap(myassign$propHomoplMerged, "Merged across populations")

# plot proportion of missing data, after recoding, for each locus and parameter set
plotParamHeatmap(myassign$missRate, main = "Missing data:")

[Package polysat version 1.7-7 Index]