R: Identify Non-Mendelian Loci and Taxa that Deviate from Ploidy...

HindHe {polyRAD}

R Documentation

Identify Non-Mendelian Loci and Taxa that Deviate from Ploidy Expectations

Description

HindHe and HindHeMapping both generate a matrix of values, with taxa in rows and loci in columns. The mean value of the matrix is expected to be a certain value depending on the ploidy and, in the case of natural populations and diversity panels, the inbreeding coefficient. colMeans of the matrix can be used to filter non-Mendelian loci from the dataset. rowMeans of the matrix can be used to identify taxa that are not the expected ploidy, are interspecific hybrids, or are a mix of multiple samples.

Usage

HindHe(object, ...)

## S3 method for class 'RADdata'
HindHe(object, omitTaxa = GetBlankTaxa(object), ...)

HindHeMapping(object, ...)

## S3 method for class 'RADdata'
HindHeMapping(object, n.gen.backcrossing = 0, n.gen.intermating = 0,
              n.gen.selfing = 0, ploidy = object$possiblePloidies[[1]],
              minLikelihoodRatio = 10,
              omitTaxa = c(GetDonorParent(object), GetRecurrentParent(object), 
                           GetBlankTaxa(object)), ...)

Arguments

`object`	A `RADdata` object. Genotype calling does not need to have been performed yet. If the population is a mapping population, `SetDonorParent` and `SetRecurrentParent` should have been run already.
`omitTaxa`	A character vector indicating names of taxa not to be included in the output. For `HindHe`, these taxa will also be omitted from allele frequency estimations.
`n.gen.backcrossing`	The number of generations of backcrossing performed in a mapping population.
`n.gen.intermating`	The number of generations of intermating performed in a mapping population. Included for consistency with `PipelineMapping2Parents`, but currently will give an error if set to any value other than zero. If the most recent generation in your mapping population was random mating among all progeny, use `HindHe` instead of `HindHeMapping`.
`n.gen.selfing`	The number of generations of self-fertilization performed in a mapping population.
`ploidy`	A single value indicating the assumed ploidy to test. Currently, only autopolyploid and diploid inheritance modes are supported.
`minLikelihoodRatio`	Used internally by `EstimateParentalGenotypes` as a threshold for certainty of parental genotypes. Decrease this value if too many markers are being discarded from the calculation.
`...`	Additional arguments (none implemented).

Details

These functions are especially useful for highly duplicated genomes, in which RAD tag alignments may have been incorrect, resulting in groups of alleles that do not represent true Mendelian loci. The statistic that is calculated is based on the principle that observed heterozygosity will be higher than expected heterozygosity if a "locus" actually represents two or more collapsed paralogs. However, the statistic uses read depth in place of genotypes, eliminating the need to perform genotype calling before filtering.

For a given taxon * locus, H_{ind} is the probability that two sequencing reads, sampled without replacement, are different alleles (RAD tags).

In HindHe, H_E is the expected heterozygosity, estimated from allele frequencies by taking the column means of object$depthRatios. This is also the estimated probability that if two alleles were sampled at random from the population at a given locus, they would be different alleles.

In HindHeMapping, H_E is the average probability that in a random progeny, two alleles sampled without replacement would be different. The number of generations of backcrossing and self-fertilization, along with the ploidy and estimated parental genotypes, are needed to make this calculation. The function essentially simulates the mapping population based on parental genotypes to determine H_E.

The expectation is that

H_{ind}/H_E = \frac{ploidy - 1}{ploidy} * (1 - F)

in a diversity panel, where F is the inbreeding coefficient, and

H_{ind}/H_E = \frac{ploidy - 1}{ploidy}

in a mapping population. Loci that have much higher average values likely represent collapsed paralogs that should be removed from the dataset. Taxa with much higher average values may be higher ploidy than expected, interspecific hybrids, or multiple samples mixed together.

Value

A named matrix, with taxa in rows and loci in columns. For HindHeMapping, loci are omitted if consistent parental genotypes could not be determined across alleles.

Author(s)

Lindsay V. Clark

References

Clark, L. V., Mays, W., Lipka, A. E. and Sacks, E. J. (2022) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 23, 101, doi:10.1186/s12859-022-04635-9.

A seminar describing H_{ind}/H_E is available at https://youtu.be/Z2xwLQYc8OA?t=1678.

Examples

data(exampleRAD)

hhmat <- HindHe(exampleRAD)
colMeans(hhmat, na.rm = TRUE) # near 0.5 for diploid loci, 0.75 for tetraploid loci

data(exampleRAD_mapping)
exampleRAD_mapping <- SetDonorParent(exampleRAD_mapping, "parent1")
exampleRAD_mapping <- SetRecurrentParent(exampleRAD_mapping, "parent2")

hhmat2 <- HindHeMapping(exampleRAD_mapping, n.gen.backcrossing = 1)
colMeans(hhmat2, na.rm = TRUE) # near 0.5; all loci diploid

[Package polyRAD version 2.0.0 Index]