R: Calculate the log-likelihoods of pedigrees

pedigree_loglikelihood {clipp}

R Documentation

Calculate the log-likelihoods of pedigrees

Description

For one or more pedigrees, this function calculates the natural logarithm of the pedigree likelihood that is on page 117 of (Lange, 2002), given inputs that correspond to the terms in this formula.

Usage

pedigree_loglikelihood(
  dat,
  geno_freq,
  trans,
  penet,
  monozyg = NULL,
  sum_loglik = TRUE,
  ncores = 1,
  load_balancing = TRUE
)

Arguments

`dat`	A data frame with rows corresponding to people and columns corresponding to the following variables (other variables can be included but will be ignored), which will be coerced to `character` type: `family` (optional), an identifier for each person's family, constant within families. If this variable is not supplied then `dat` will be treated as a single pedigree. `indiv`, an individual identifier for each person. If there are any duplicated identifiers in the dataset then the family and an underscore (`⁠_⁠`) will be prepended to all identifiers, and if any duplicates remain after this then the function will stop executing, with an error message. `mother`, the individual identifier of each person's mother, or missing (`NA`) for founders. `father`, the individual identifier of each person's father, or missing (`NA`) for founders.
`geno_freq`	A vector of strictly positive numbers that sum to `1`. If the possible genotypes of the underlying genetic model are `1:length(geno_freq)` then `geno_freq[j]` is interpreted as the population frequency of genotype `j`, so `geno_freq` is essentially the function `Prior` in the pedigree likelihood on page 117 of (Lange, 2002). For certain genetic models that often occur in applications, these genotype frequencies can be calculated by `geno_freq_monogenic`, `geno_freq_phased`, etc.
`trans`	An `ngeno^2` by `ngeno` matrix of non-negative numbers whose rows all sum to `1`, where `ngeno = length(geno_freq)` is the number of possible genotypes. The rows of `trans` correspond to joint parental genotypes and the columns correspond to offspring genotypes. If the possible genotypes are `1:length(geno_freq)` then the element `trans[ngeno * gm + gf - ngeno, go]` is interpreted as the conditional probability that a person has genotype `go`, given that his or her biological mother and father have genotypes `gm` and `gf`, respectively. So `trans` is essentially the transmission function `Tran` on page 117 of (Lange, 2002). For certain genetic models that often occur in applications, this transmission matrix can be calculated by `trans_monogenic`, `trans_phased`, etc.
`penet`	An `nrow(dat)` by `length(geno_freq)` matrix of non-negative numbers. The element `penet[i,j]` is interpreted as the conditional probability (or probability density) of the phenotype of the person corresponding to row `i` of `dat`, given that his or her genotype is `j` (where the possible genotypes are `1:length(geno_freq)`). Therefore, `penet` is essentially the penetrance function `Pen` on page 117 of (Lange, 2002). If any row of `penet` consists entirely of zeroes then the likelihood is `0`, so the returned log-likelihood will be `-Inf`. Note that genotype data can be incorporated into `penet` by regarding observed genotypes as part of the phenotype, i.e. by regarding observed genotypes as (possibly noisy) measurements of the underlying true genotypes. For example, if the observed genotype of person `i` is `1` (and if genotype measurement error is negligible) then `penet[i,j]` should be `0` for `j != 1` and `penet[i,1]` should be the same as if person `i` were ungenotyped.
`monozyg`	An optional list that can be used to specify genetically identical persons, such as monozygotic twins, monozygotic triplets, a monozygotic pair within a set of dizygotic triplets, etc. Each element of the list should be a vector containing the individual identifiers of a group of genetically identical persons, e.g. if `dat` contains six sets of monozygotic twins and one set of monozygotic triplets then `monozyg` will be a list with seven elements, one element a vector of length three and the other six elements all vectors of length two. The order of the list and the orders within its elements do not affect the output of the function. Each group of genetically identical persons should contain two or more persons, the groups should not overlap, and all persons in each group must have the same (non-missing) parents.
`sum_loglik`	A logical flag. Return a named vector giving the log-likelihood of each family if `sum_loglik` is `FALSE`, or return the sum of these log-likelihoods if `sum_loglik` is `TRUE` (the default).
`ncores`	The number of cores to be used, with `ncores = 1` (the default) corresponding to non-parallel computing. When `ncores > 1`, the `parallel` package is used to parallelize the calculation by dividing the pedigrees among the different cores.
`load_balancing`	A logical flag. When `ncores > 1`, parallelization is achieved either with the function `parallel::parLapply` (if `load_balancing` is `FALSE`) or with the load-balancing function `parallel::parLapplyLB` (if `load_balancing` is `TRUE`, the default). The load-balancing version will usually, but not always, be faster.

Details

This function provides a fast and general implementation of the Elston-Stewart algorithm to calculate the log-likelihoods of potentially large and complex pedigrees. General references for the Elston-Stewart algorithm are (Elston & Stewart, 1971), (Lange & Elston, 1975) and (Cannings et al., 1978).

Each family within dat should be a complete pedigree, meaning that each person should either have both parental identifiers missing (if a founder) or both non-missing (if a non-founder), and each (non-missing) mother or father should have a corresponding row of dat.

Observed genotypes should be incorporated into penet, as described above.

The function can handle pedigree loops, such as those caused by inbreeding or by two sisters having children with two brothers from an unrelated family (see (Totir et al., 2009) for a precise definition), though pedigrees with more than a few loops could greatly reduce the speed of the calculation.

In geno_freq, trans and penet, the order of the possible genotypes must match, in the sense that the genotype that corresponds to element j of geno_freq must also correspond to column j of trans and penet, for each j in 1:length(geno_freq).

Sex-specific genetics, such as X-linked genes or genetic loci with sex-specific recombination fractions, can be modelled by letting genotypes 1:nm be the possible male genotypes and letting (nm+1):(nm+nf) be the possible female genotypes, where nm and nf are the number of possible genotypes for males and females, respectively. Then, for example, penet[i,j] will be 0 if j %in% 1:nm and row i of dat corresponds to a female, and penet[i,j] will be 0 if j %in% (nm+1):(nm+nf) and row i of dat corresponds to a male.

Value

Either a named vector giving the log-likelihood of each family or the sum of these log-likelihoods, depending on sum_loglik (see above).

References

Cannings C, Thompson E, Skolnick M. Probability functions on complex pedigrees. Advances in Applied Probability, 1978;10(1):26-61.

Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Hum Hered. 1971;21(6):523-542.

Lange K. Mathematical and Statistical Methods for Genetic Analysis (second edition). Springer, New York. 2002.

Lange K, Elston RC. Extensions to pedigree analysis I. Likehood calculations for simple and complex pedigrees. Hum Hered. 1975;25(2):95-105.

Totir LR, Fernando RL, Abraham J. An efficient algorithm to compute marginal posterior genotype probabilities for every member of a pedigree with loops. Genet Sel Evol. 2009;41(1):52.

Examples

# Load pedigree files and penetrance matrices
data("dat_small", "penet_small", "dat_large", "penet_large")

# Settings for a single biallelic locus in Hardy-Weinberg equilibrium
# and with a minor allele frequency of 10%
geno_freq <- geno_freq_monogenic(c(0.9, 0.1))
trans <- trans_monogenic(2)

# In dat_small, ora024 and ora027 are identical twins, and so are aey063 and aey064
monozyg_small <- list(c("ora024", "ora027"), c("aey063", "aey064"))

# Calculate the log-likelihoods for 10 families, each with approximately
# 100 family members
pedigree_loglikelihood(
  dat_small, geno_freq, trans, penet_small, monozyg_small, sum_loglik = FALSE, ncores = 2
)

# Calculate the log-likelihood for one family with approximately 10,000 family members
# Note:  this calculation should take less than a minute on a standard desktop computer
# Note:  parallelization would achieve nothing here because there is only one family
str(dat_large)

system.time(
  ll <- pedigree_loglikelihood(dat_large, geno_freq, trans, penet_large)
)
ll

[Package clipp version 1.1.1 Index]