R: Perform Allele Assignments across Entire Dataset

plotSSAllo {polysat}

R Documentation

Perform Allele Assignments across Entire Dataset

Description

processDatasetAllo runs alleleCorrelations on every locus in a "genambig" object, then runs testAlGroups on every locus using several user-specified parameter sets. It chooses a single best set of allele assignments for each locus, and produces plots to help the user evaluate assignment quality. plotSSAllo assists the user in evaluating the quality of allele assignments by plotting the results of K-means clustering. plotParamHeatmap assists the user in choosing the best parameter set for testAlGroups for each locus.

Usage

plotSSAllo(AlCorrArray)
plotParamHeatmap(propMat, popname = "AllInd", col = grey.colors(12)[12:1], main = "")
processDatasetAllo(object, samples = Samples(object), loci = Loci(object),
                   n.subgen = 2, SGploidy = 2, n.start = 50, alpha = 0.05,
                   parameters = data.frame(tolerance     = c(0.05, 0.05,  0.05, 0.05),
                                           swap          = c(TRUE, FALSE, TRUE, FALSE),
                                           null.weight   = c(0.5,  0.5,   0,    0)),
                   plotsfile = "alleleAssignmentPlots.pdf", usePops = FALSE, ...)

Arguments

`AlCorrArray`	A two-dimensional list, where each item in the list is the output of `alleleCorrelations`. The first dimension represents loci, and the second dimension represents populations. Both dimensions are named. This is the `$AlCorrArray` output of `processDatasetAllo`.
`propMat`	A two-dimensional array, with loci in the first dimension and parameter sets in the second dimension, indicating the proportion of alleles that were found to be homoplasious by `testAlGroups` or the proportion of genotypes that could not be recoded using a given set of allele assignments. This can be the `$propHomoplasious` output of `processDatasetAllo`, indexed by a single population. If a three-dimensional array is provided, it will be indexed in the second dimension by `popname`. The `$propHomoplMerged` or `$missRate` output of `processDatasetAllo` may also be passed to this argument.
`popname`	The name of the population corresponding to the data in `propMat`.
`col`	The color scale for representing the proportion of loci that are homoplasious or the proportion of genotypes that are missing.
`main`	A title for the plot.
`object`	A `"genambig"` object.
`samples`	An optional character vector indicating which samples to include in analysis.
`loci`	An optional character vector indicating which loci to include in analysis.
`n.subgen`	The number of isoloci into which each locus should be split. Passed directly to `alleleCorrelations`.
`SGploidy`	The ploidy of each isolocus. Passed directly to `testAlGroups`.
`n.start`	Passed directly to the `nstart` argument of `kmeans`. See `alleleCorrelations`.
`alpha`	The significance threshold for determining whether two alleles are significantly correlated. Used primarily for identifying potentially problematic positive correlations. Passed directly to `alleleCorrelations`.
`parameters`	Data frame indicating parameter sets to pass to `testAlGroups`. Each row is one set of parameters.
`plotsfile`	A PDF output file name for drawing plots to help assess assignment quality. Can be `NULL` if no plots are desired.
`usePops`	If `TRUE`, population assignments are taken from the `PopInfo` slot of `object`, and populations are analyzed separately with `alleleCorrelations` and `testAlGroups`, before merging the results with `mergeAlleleAssignments`.
`...`	Additional parameters to pass to `testAlGroups` for adjusting the simulated annealing algorithm.

Details

plotSSAllo produces a plot of loci by population, with the sums-of-squares ratio on the x-axis and the evenness of allele distribution on the y-axis (see Value). Locus names are written directly on the plot. If there are multiple population names, locus names are colored by population, and a legend is provided for colors. Loci with high-quality allele clustering are expected to be in the upper-right quadrant of the plot. If locus names are in italics, it indicates that positive correlations were found between some alleles, indicating population structure or scoring error that could interfere with assignment quality.

plotParamHeatmap produces an image to indicate the proportion of alleles found to be homoplasious, or the proportion of genotypes that could not be unambiguously recoded using allele assignments, for each locus and parameter set for a given population (when looking at homoplasy) or merged across populations (for homoplasy or the proportion of non-recodeable genotypes). Darker colors indicate more homoplasy or more genotypes that could not be recoded. The word “best” indicates, for each locus, the parameter set that found the least homoplasy or smallest number of non-recodeable genotypes.

By default, processDatasetAllo generates a PDF file containing output from plotSSAllo and plotParamHeatmap, as well as heatmaps of the $heatmap.dist output of alleleCorrelations for each locus and population. Heatmaps are not plotted for loci where an allele is present in all individuals. processDatasetAllo also generates a list of R objects containing allele assignments under different parameters, as well as statistics for evaluating clustering quality and choosing the optimal parameter sets, as described below.

Value

plotSSAllo draws a plot and invisibly returns a list:

`ssratio`	A two-dimensional array with loci in the first dimension and populations in the second dimension. Each value is the sums-of-squares between isoloci divided by the total sums-of-squares, as output by K-means clustering. If K-means clustering was not performed, the value is zero.
`evenness`	An array of the same dimensions as `$ssratio`, containing values to indicate how evenly alleles are distributed among isoloci as determined by K-means clustering. This is: `1 - \sum_{1}^{i}{(\frac{a_{i}}{A})^2}` where `i` is the number of isoloci, `a_i` is the number of alleles for a given isolocus, and `A` is the total number of alleles for the locus.
`max.evenness`	The maximum possible value for `$evenness`, given the number of isoloci.
`min.evenness`	The minimum possible value for `$evenness`, given the number of isoloci and alleles.
`posCor`	An array of the same dimensions as `$ssratio`, containing `TRUE` if there were any positive correlations between alleles, and `FALSE` if not.

processDatasetAllo returns a list:

`AlCorrArray`	A two-dimensional list with loci in the first dimension and populations in the second dimension, giving the results of `alleleCorrelations`.
`TAGarray`	A three-dimensional list with loci in the first dimension, populations in the second dimension, and parameter sets in the third dimension, giving the results of `testAlGroups`.
`plotSS`	The output of `plotSSAllo`.
`propHomoplasious`	A three-dimensional array, with the same dimensions as `$TAGarray`, indicating the proportion of alleles that were found to be homoplasious for each locus, population, and parameter set.
`mergedAssignments`	A two-dimensional list, with loci in the first dimension and parameter sets in the second dimension, containing allele assignments merged across populations. This is the output of `mergeAlleleAssignments`.
`propHomoplMerged`	A two-dimensional array, of the same dimensions as `$mergedAssignments`, indicating the proportion of alleles that were homoplasious, for each locus and parameter set, for allele assignments that were merged across populations.
`missRate`	A matrix with the same dimensions as `$mergedAssignments` indicating the proportion of non-missing genotypes from the original dataset that cannot be unambiguously recoded, without invoking aneuploidy, using the merged allele assignments from each parameter set for each locus.
`bestAssign`	A one-dimensional list with a single best set of allele assignments, from `$mergedAssignments`, for each locus. The best set of assignments is chosen using `$missRate`, then in the case of a tie using `$propHomoplMerged`, then in the case of a tie using the parameter set that was listed first.

plotParamHeatmap draws a plot and does not return anything.

Author(s)

Lindsay V. Clark

References

Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between allelic variables. Molecular Ecology Resources, 17, 1090–1103. DOI: 10.1111/1755-0998.12639.

Examples

# get example dataset
data(AllopolyTutorialData)

# data cleanup
mydata <- deleteSamples(AllopolyTutorialData, c("301", "302", "303"))
PopInfo(mydata) <- rep(1:2, each = 150)
Genotype(mydata, 43, 2) <- Missing(mydata)

# allele assignments
# R is set to 10 here to speed processing for example.  It should typically be left at the default.
myassign <- processDatasetAllo(mydata, loci = c("Loc3", "Loc6"),
                               plotsfile = NULL, usePops = TRUE, R = 10,
                               parameters = data.frame(tolerance = c(0.5, 0.5), 
                               swap = c(TRUE, FALSE),
                               null.weight = c(0.5, 0.5)))

# view best assignments for each locus
myassign$bestAssign
                               
# plot K-means results
plotSSAllo(myassign$AlCorrArray)

# plot proportion of homoplasious alleles
plotParamHeatmap(myassign$propHomoplasious, "Pop1")
plotParamHeatmap(myassign$propHomoplasious, "Pop2")
plotParamHeatmap(myassign$propHomoplMerged, "Merged across populations")

# plot proportion of missing data, after recoding, for each locus and parameter set
plotParamHeatmap(myassign$missRate, main = "Missing data:")

[Package polysat version 1.7-7 Index]