HindHe {polyRAD} | R Documentation |
Identify Non-Mendelian Loci and Taxa that Deviate from Ploidy Expectations
Description
HindHe
and HindHeMapping
both generate a matrix of values, with
taxa in rows and loci in columns. The mean value of the matrix is expected to
be a certain value depending on the ploidy and, in the case of natural
populations and diversity panels, the inbreeding coefficient. colMeans
of the matrix can be used to filter non-Mendelian loci from the dataset.
rowMeans
of the matrix can be used to identify taxa that are not the
expected ploidy, are interspecific hybrids, or are a mix of multiple samples.
Usage
HindHe(object, ...)
## S3 method for class 'RADdata'
HindHe(object, omitTaxa = GetBlankTaxa(object), ...)
HindHeMapping(object, ...)
## S3 method for class 'RADdata'
HindHeMapping(object, n.gen.backcrossing = 0, n.gen.intermating = 0,
n.gen.selfing = 0, ploidy = object$possiblePloidies[[1]],
minLikelihoodRatio = 10,
omitTaxa = c(GetDonorParent(object), GetRecurrentParent(object),
GetBlankTaxa(object)), ...)
Arguments
object |
A |
omitTaxa |
A character vector indicating names of taxa not to be included in the output.
For |
n.gen.backcrossing |
The number of generations of backcrossing performed in a mapping population. |
n.gen.intermating |
The number of generations of intermating performed in a mapping population.
Included for consistency with |
n.gen.selfing |
The number of generations of self-fertilization performed in a mapping population. |
ploidy |
A single value indicating the assumed ploidy to test. Currently, only autopolyploid and diploid inheritance modes are supported. |
minLikelihoodRatio |
Used internally by |
... |
Additional arguments (none implemented). |
Details
These functions are especially useful for highly duplicated genomes, in which RAD tag alignments may have been incorrect, resulting in groups of alleles that do not represent true Mendelian loci. The statistic that is calculated is based on the principle that observed heterozygosity will be higher than expected heterozygosity if a "locus" actually represents two or more collapsed paralogs. However, the statistic uses read depth in place of genotypes, eliminating the need to perform genotype calling before filtering.
For a given taxon * locus, H_{ind}
is the probability that two
sequencing reads, sampled without replacement, are different alleles (RAD tags).
In HindHe
, H_E
is the expected heterozygosity, estimated from
allele frequencies by taking the column means of object$depthRatios
.
This is also the estimated probability that if two alleles were sampled at
random from the population at a given locus, they would be different alleles.
In HindHeMapping
, H_E
is the average probability that in
a random progeny, two alleles sampled without replacement would be different.
The number of generations of backcrossing and self-fertilization, along with the
ploidy and estimated parental genotypes, are needed to make this calculation.
The function essentially simulates the mapping population based on parental
genotypes to determine H_E
.
The expectation is that
H_{ind}/H_E = \frac{ploidy - 1}{ploidy} * (1 - F)
in a diversity panel, where F
is the inbreeding coefficient, and
H_{ind}/H_E = \frac{ploidy - 1}{ploidy}
in a mapping population. Loci that have much higher average values likely represent collapsed paralogs that should be removed from the dataset. Taxa with much higher average values may be higher ploidy than expected, interspecific hybrids, or multiple samples mixed together.
Value
A named matrix, with taxa in rows and loci in columns. For HindHeMapping
,
loci are omitted if consistent parental genotypes could not be determined across
alleles.
Author(s)
Lindsay V. Clark
References
Clark, L. V., Mays, W., Lipka, A. E. and Sacks, E. J. (2022) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 23, 101, doi:10.1186/s12859-022-04635-9.
A seminar describing
H_{ind}/H_E
is available at https://youtu.be/Z2xwLQYc8OA?t=1678.
See Also
InbreedingFromHindHe
,
ExpectedHindHe
Examples
data(exampleRAD)
hhmat <- HindHe(exampleRAD)
colMeans(hhmat, na.rm = TRUE) # near 0.5 for diploid loci, 0.75 for tetraploid loci
data(exampleRAD_mapping)
exampleRAD_mapping <- SetDonorParent(exampleRAD_mapping, "parent1")
exampleRAD_mapping <- SetRecurrentParent(exampleRAD_mapping, "parent2")
hhmat2 <- HindHeMapping(exampleRAD_mapping, n.gen.backcrossing = 1)
colMeans(hhmat2, na.rm = TRUE) # near 0.5; all loci diploid