infer_m {whoa}R Documentation

get posterior estimates for m from different read depth categories

Description

This function calls internal C++ routines that perform Markov chain Monte Carlo (MCMC) to sample from the posterior distribution of the heterozygote miscall rate for each read depth category.

Usage

infer_m(
  v,
  minBin,
  indivs = NULL,
  init_m = 0.1,
  num_sweeps = 500,
  burn_in = 100
)

Arguments

v

a ‘vcfR’ object holding the information from a variant call format (VCF) file with the genotype and depth data. If you are going to be breaking the estimates down by read depth categories, the VCF file must have a DP field for every genotype.

minBin

minimum number of observations for each read depth bin. If you have 10K markers and 50 individuals then you have about 500,000 genotypes to play with. Requiring bins with at least 5,000 genotypes in them will give you less than 100 bins. You can play around with this number to get the right number of bins. The algorithm breaks the read depths up into bins that have at least minBin genotypes in them.

indivs

a character vector holding the names of the individuals from the VCF file to include in the analysis. By default this is NULL, in which case everyone from the file is included.

init_m

the initial value of the heterozygote miscall rate assumed for each read depth bin. By default this is 0.1.

num_sweeps

the number of sweeps of the MCMC algorithm to do. Default is 500, which is a little on the short side. Run multiple times and make sure the values obtained are similar across runs to assess convergence.

burn_in

how many sweeps from the beginning of the chain to discard when computing the posterior means and quantiles. Default is 100. Note that full traces of the visited m values for every read depth bin are returned as well, so that the behavior of the chain in those early steps can be investigated.

Details

The read depth bins are determined by passing to the function minBin—the minimum number of observations desired for each read depth bin. The function then breaks the observations into bins so that each read depth bin has at least minBin observations.

Note that if you want to estimate the heterozygote miscall rate overall (i.e., not conditioning each estimate on a read depth bin), then simply give a very large number (larger than the number of markers times the number of individuals) for minBin. For example, you could use a number like 1e15 for minBin. As a consequence, all the genotypes will be put into a single read depth bin.

Value

A list with six components:

m_posteriors

A tibble with 6 columns: bin = the index of the read depth bin; mean = the posterior mean estimate of the the heterozygote miscall rate in that bin; lo95 = the low endpoint of the 95 95 and mean_dp = the mean read depth in the bin.

m_traces

A tibble with all the values visited for m for every read depth bin. This tibble has three columns: bin = the index of the read depth bin; sweep = the sweep number, value = the value of m for that read depth bin in that particular sweep.

dp_summary

A tibble summarizing how many genotypes of different read depths appear in each bin.

bin_stats

A tibble with a different summary of the read-depth bins.

num_sweeps

Number of MCMC sweeps used.

burn_in

Number of sweeps discarded as burn in.

Examples

# Shorter run than recommended (for quick example...)
im <- infer_m(lobster_buz_2000, minBin = 1000, num_sweeps = 100, burn_in = 20)

[Package whoa version 0.0.2 Index]