R: Reports observed, expected and unbiased heterozygosities and...

gl.report.heterozygosity {dartR.base}

R Documentation

Reports observed, expected and unbiased heterozygosities and FIS (inbreeding coefficient) by population or by individual from SNP data

Description

Calculates the observed, expected and unbiased expected (i.e. corrected for sample size) heterozygosities and FIS (inbreeding coefficient) for each population or the observed heterozygosity for each individual in a genlight object.

Usage

gl.report.heterozygosity(
  x,
  method = "pop",
  n.invariant = 0,
  nboots = 0,
  conf = 0.95,
  CI.type = "bca",
  ncpus = 1,
  plot.display = TRUE,
  plot.theme = theme_dartR(),
  plot.colors.pop = gl.colors("dis"),
  plot.colors.ind = gl.colors(2),
  error.bar = "SD",
  save2tmp = FALSE,
  verbose = NULL
)

Arguments

`x`	Name of the genlight object containing the SNP [required].
`method`	Calculate heterozygosity by population (method='pop') or by individual (method='ind') [default 'pop'].
`n.invariant`	An estimate of the number of invariant sequence tags used to adjust the heterozygosity rate [default 0].
`nboots`	Number of bootstrap replicates to obtain confidence intervals [default 0].
`conf`	The confidence level of the required interval [default 0.95].
`CI.type`	Method to estimate confidence intervals. One of "norm", "basic", "perc" or "bca" [default "bca"].
`ncpus`	Number of processes to be used in parallel operation. If ncpus > 1 parallel operation is activated,see "Details" section [default 1].
`plot.display`	Specify if plot is to be produced [default TRUE].
`plot.theme`	Theme for the plot. See Details for options [default theme_dartR()].
`plot.colors.pop`	A color palette for population plots or a list with as many colors as there are populations in the dataset [default gl.colors("dis")].
`plot.colors.ind`	List of two color names for the borders and fill of the plot by individual [default gl.colors(2)].
`error.bar`	statistic to be plotted as error bar either "SD" (standard deviation) or "SE" (standard error) or "CI" (confident intervals) [default "SD"].
`save2tmp`	If TRUE, saves any ggplots and listings to the session temporary directory (tempdir) [default FALSE].
`verbose`	Verbosity: 0, silent or fatal errors; 1, begin and end; 2, progress log; 3, progress and results summary; 5, full report [default NULL, unless specified using gl.set.verbosity].

Details

Observed heterozygosity for a population takes the proportion of heterozygous loci for each individual and averages it over all individuals in that population. The calculations take into account missing values.

Expected heterozygosity for a population takes the expected proportion of heterozygotes, that is, expected under Hardy-Weinberg equilibrium, for each locus, then averages this across the loci for an average estimate for the population.

The unbiased expected heterozygosity is calculated using the correction for sample size following equation 2 from Nei 1978.

Accuracy of all heterozygosity estimates is affected by small sample sizes, and so is their comparison between populations or repeated analysis. Expected heterozygosities are less affected because their calculations are based on allele frequencies while observed heterozygosities are strongly susceptible to sampling effects when the sample size is small.

Observed heterozygosity for individuals is calculated as the proportion of loci that are heterozygous for that individual.

Finally, the loci that are invariant across all individuals in the dataset (that is, across populations), is typically unknown. This can render estimates of heterozygosity analysis specific, and so it is not valid to compare such estimates across species or even across different analyses (see Schimdt et al 2021). This is a similar problem faced by microsatellites. If you have an estimate of the number of invariant sequence tags (loci) in your data, such as provided by gl.report.secondaries, you can specify it with the n.invariant parameter to standardize your estimates of heterozygosity. This is called autosomal heterozygosities by Schimddt et al (2021).

NOTE: It is important to realise that estimation of adjusted (autosomal) heterozygosity requires that secondaries not to be removed.

Heterozygosities and FIS (inbreeding coefficient) are calculated by locus within each population using the following equations, and then averaged across all loci:

Observed heterozygosity (Ho) = number of heterozygotes / n_Ind, where n_Ind is the number of individuals without missing data for that locus.
Observed heterozygosity adjusted (Ho.adj) <- Ho * n_Loc / (n_Loc + n.invariant), where n_Loc is the number of loci that do not have all missing data and n.invariant is an estimate of the number of invariant loci to adjust heterozygosity.
Expected heterozygosity (He) = 1 - (p^2 + q^2), where p is the frequency of the reference allele and q is the frequency of the alternative allele.
Expected heterozygosity adjusted (He.adj) = He * n_Loc / (n_Loc + n.invariant)
Unbiased expected heterozygosity (uHe) = He * (2 * n_Ind / (2 * n_Ind - 1))
Inbreeding coefficient (FIS) = 1 - Ho / uHe

Function's output Output for method='pop' is an ordered barchart of observed heterozygosity, unbiased expected heterozygosity and FIS (Inbreeding coefficient) across populations together with a table of mean observed and expected heterozygosities and FIS by population and their respective standard deviations (SD). In the output, it is also reported by population: the number of loci used to estimate heterozygosity(n.Loc), the number of polymorphic loci (polyLoc), the number of monomorphic loci (monoLoc) and loci with all missing data (all_NALoc). Output for method='ind' is a histogram and a boxplot of heterozygosity across individuals. Plots and table are saved to the session temporary directory (tempdir) Examples of other themes that can be used can be consulted in

Error bars

The best method for presenting or assessing genetic statistics depends on the type of data you have and the specific questions you're trying to answer. Here's a brief overview of when you might use each method:

1. Confidence Intervals ("CI"):

- Usage: Often used to convey the precision of an estimate.

- Advantage: Confidence intervals give a range in which the true parameter (like a population mean) is likely to fall, given the data and a specified probability (like 95

- In Context: For genetic statistics, if you're estimating a parameter, a 95 lies.

2. Standard Deviation ("SD"):

- Usage: Describes the amount of variation from the average in a set of data.

- Advantage: Allows for an understanding of the spread of individual data points around the mean.

- In Context: If you're looking at the distribution of a quantitative trait (like height) in a population with a particular genotype, the SD can describe how much individual heights vary around the average height.

3. Standard Error ("SE"):

- Usage: Describes the precision of the sample mean as an estimate of the population mean.

- Advantage: Smaller than the SD in large samples; it takes into account both the SD and the sample size.

- In Context: If you want to know how accurately your sample mean represents the population mean, you'd look at the SE.

Recommendation:

- If you're trying to convey the precision of an estimate, confidence intervals are very useful.

- For understanding variability within a sample, standard deviation is key.

- To see how well a sample mean might estimate a population mean, consider the standard error.

In practice, geneticists often use a combination of these methods to analyze and present their data, depending on their research questions and the nature of the data.

Confident Intervals

The uncertainty of a parameter, in this case the mean of the statistic, can be summarised by a confidence interval (CI) which includes the true parameter value with a specified probability (i.e. confidence level; the parameter "conf" in this function).

In this function, CI are obtained using Bootstrap which is an inference method that samples with replacement the data (i.e. loci) and calculates the statistics every time.

This function uses the function boot (package boot) to perform the bootstrap replicates and the function boot.ci (package boot) to perform the calculations for the CI.

Four different types of nonparametric CI can be calculated (parameter "CI.type" in this function):

First order normal approximation interval ("norm").
Basic bootstrap interval ("basic").
Bootstrap percentile interval ("perc").
Adjusted bootstrap percentile interval ("bca").

The studentized bootstrap interval ("stud") was not included in the CI types because it is computationally intensive, it may produce estimates outside the range of plausible values and it has been found to be erratic in practice, see for example the "Studentized (t) Intervals" section in:

https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/

Nice tutorials about the different types of CI can be found in:

https://www.datacamp.com/tutorial/bootstrap-r

and

https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/

Efron and Tibshirani (1993, p. 162) and Davison and Hinkley (1997, p. 194) suggest that the number of bootstrap replicates should be between 1000 and 2000.

It is important to note that unreliable confident intervals will be obtained if too few number of bootstrap replicates are used. Therefore, the function boot.ci will throw warnings and errors if bootstrap replicates are too few. Consider increasing the number of bootstrap replicates to at least 200.

The "bca" interval is often cited as the best for theoretical reasons, however it may produce unstable results if the bootstrap distribution is skewed or has extreme values. For example, you might get the warning "extreme order statistics used as endpoints" or the error "estimated adjustment 'a' is NA". In this case, you may want to use more bootstrap replicates or a different method or check your data for outliers.

The error "estimated adjustment 'w' is infinite" means that the estimated adjustment ‘w’ for the "bca" interval is infinite, which can happen when the empirical influence values are zero or very close to zero. This can be caused by various reasons, such as:

The number of bootstrap replicates is too small, the statistic of interest is constant or nearly constant across the bootstrap samples, the data contains outliers or extreme values.

You can try some possible solutions, such as:

Increasing the number of bootstrap replicates, using a different type of bootstrap confidence interval or removing or transforming the outliers or extreme values.

Parallelisation

If the parameter ncpus > 1, parallelisation is enabled. In Windows, parallel computing employs a "socket" approach that starts new copies of R on each core. POSIX systems, on the other hand (Mac, Linux, Unix, and BSD), utilise a "forking" approach that replicates the whole current version of R and transfers it to a new core.

Opening and terminating R sessions in each core involves a significant amount of processing time, therefore parallelisation in Windows machines is only quicker than not using parallelisation when nboots > 1000-2000.

Value

A dataframe containing population labels, heterozygosities, FIS, their standard deviations and sample sizes

Author(s)

Custodian: Luis Mijangos (Post to https://groups.google.com/d/forum/dartr)

References

Nei, M. (1978). Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3), 583-590.

Examples

 
require("dartR.data")
df <- gl.report.heterozygosity(platypus.gl)
df <- gl.report.heterozygosity(platypus.gl,method='ind')
n.inv <- gl.report.secondaries(platypus.gl)
gl.report.heterozygosity(platypus.gl, n.invariant = n.inv[7, 2])

df <- gl.report.heterozygosity(platypus.gl)

[Package dartR.base version 0.65 Index]