R: Statistical Inference for Mixed Samples of DNA (Lite-Version)

DNAmixturesLite-package {DNAmixturesLite}

R Documentation

Statistical Inference for Mixed Samples of DNA (Lite-Version)

Description

Tools for statistical inference for one or multiple DNA mixtures.

IMPORTANT: This is the DNAmixturesLite package, which is intended as a service to enable users to try DNAmixtures without purchasing a commercial licence for Hugin. When at all possible, we strongly recommend the use of DNAmixtures rather than this lite-version. See https://dnamixtures.r-forge.r-project.org/ for details on both packages.

While the lite-version seeks to provide the full functionality of DNAmixtures, note that computations are much less efficient and that there are some differences in available functionality. Be aware that the present documentation is copied from DNAmixtures and thus may not accurately describe the implementation of this lite-version.

Details

The package implements a statistical model for analysis of one or more mixed samples of DNA in the possible presence of dropout and stutter. Details of the model can be found in Cowell et. al (2013), and details on the model checking tools and Bayesian network structure can be found in Graversen and Lauritzen (2014).

Any hypothesis involving unknown contributors relies on computations in a Bayesian network. For performing such computations, DNAmixtures package relies on Hugin (https://www.hugin.com) through the R-package RHugin. For an installation guide, see the package webpage https://dnamixtures.r-forge.r-project.org.

Although DNAmixtures can be installed with only the free version of Hugin, the size of the networks will in practice require the full licence. In theory, the implementation allows analysis with an arbitrary number of unknown contributors. However, in practice, depending on hardware and time-constraints working with up to 5 or 6 unknown contributors seems realistic.

Summary of the statistical model

The statistical model jointly models the observed peak heights and the set of contributors to the DNA sample(s). In the event of analysing multiple DNA mixtures, the union of the contributors is used as the contributor set for each mixture. By allowing a contribution of zero, we cover the case of a contributor not having contributed to a particular mixture.

Genotypes for unknown contributors are modelled using allele-frequencies from a database specified by the user. The database is also used to define the range of alleles at each marker. A genotype for an unknown contributor is represented by a vector of allele counts n_{ia}, counting for each allele a the number of alleles i that a person possesses; in the network for a marker, the allele count n_{ia} is represented by a variable n_i_a. The vector of allele counts follows a multinomial distribution with \sum_i n_{ia} = 2 and the specified allele frequencies. It is assumed that genotypes are independent across markers and between contributors. If desired, the database of allele frequencies may be corrected for F_st or sampling adjustment before use.

Peak heights are assumed mutually independent and their distributions for a fixed set of DNA profiles are modelled using gamma distributions. The peak height for allele a in EPG r is assumed to follow a gamma distribution with scale parameter \eta_r and shape parameter

\rho_r \sum_a ((1-\xi_{ra})n_{ia} + \xi_{r,a+1} n_{i,a+1})\phi_{ri}.

Applying a detection threshold C_r\ge 0, any peak height falling below the threshold is considered to be 0. The peak heights are denoted by height1, ..., heightR.

The model parameters are for each DNA mixture

\phi: The proportions of DNA from each contributor.
\rho: Amplification parameter, which will be larger for larger amounts of DNA amplified.
\eta: Scale parameter for the gamma distribution.
\xi: Mean stutter percentage. Allele a uses stutter parameter \xi_a = \xi if the allele a-1 is included in the model, and \xi_a = 0 otherwise

An alternative parametrisation uses \mu = \rho \eta and \sigma = 1/\sqrt{\rho}, which can be interpreted as the mean peak height and the coefficient of variation respectively. Besides being interpretable, an advantage of this reparametrisation is that the parameters are fairly orthogonal.

The model assumes the model parameters to be the same across markers. Relaxations of these assumptions are not implemented here.

Computation by auxiliary variables

The computational approach of the implementation of this package is discussed in Graversen and Lauritzen (2014).

The Bayesian networks include three types of auxiliary variables O, D, and Q; these can be thought of as representing the observed peak heights, the absence/presence of peaks, and the peak height distribution function. Note that if invalid tables are set – for instance if very extreme parameter values are used, or if the vector of mixture proportions is mis-labeled – then any subsequent propagation will fail. No roll-back functionality has so far been implemented to fix this, and the easiest solution is to re-fit the mixture model.

The workhorses of this package are the functions setCPT.O, setCPT.D and setCPT.Q for setting the conditional probability tables for the three types of auxiliary variables according to specified peak heights and model parameters.

Amelogenin

As an experiment, it is possible to add the marker Amelogenin, provided that the marker is named "AMEL" and that the coding of alleles X and Y is of a particular form. One example of a suitable form is the coding X = 0 and Y = 1. The allele frequencies used should then also contain a marker "AMEL", and here frequencies have a slightly different interpretation than for the rest of the markers; as all people possess one X, the frequencies of X and Y denote the presence of an additional X or Y respectively, and thus the frequencies correspond directly to the proportions of the two genders.

Author(s)

Therese Graversen theg@itu.dk

References

Details on the implemented model may be found in

Cowell, R. G., Graversen, T., Lauritzen, S., and Mortera, J. (2015). Analysis of Forensic DNA Mixtures with Artefacts. With supplementary material documenting the analyses using DNAmixtures. Journal of the Royal Statistical Society: Series C (Applied Statistics). Volume 64, Issue 1, pages 1-48.

Graversen, T. (2014) Statistical and Computational Methodology for the Analysis of Forensic DNA Mixtures with Artefacts. DPhil. University of Oxford. https://ora.ox.ac.uk/objects/uuid:4c3bfc88-25e7-4c5b-968f-10a35f5b82b0.

Graversen, T. and Lauritzen, S. (2014). Computational aspects of DNA mixture analysis. Statistics and Computing, DOI: 10.1007/s11222-014-9451-7.

Examples

## Analysing trace MC18, using USCaucasian allele-frequencies.
data(MC18, USCaucasian, SGMplusDyes)
## The prosecution hypothesis could be that K1, K2, and K3 have
## contributed to the trace. Setting detection threshold to 50rfu.
## For layout as an EPG we follow the layout of SGMplus.
mixHp <- DNAmixture(list(MC18), k = 3, K = c("K1", "K2", "K3"), C = list(50),
                  database = USCaucasian, dyes = list(SGMplusDyes))
plot(mixHp, epg = TRUE)
## The yellow is a bit difficult to see; we can
## change the colors representing the dyes
plot(mixHp, epg = TRUE, dyecol = list(c("blue", "green", "black")))

## Fit by maximum likelihood using p as the initial value for optimisation
p <- mixpar(rho = list(30), eta = list(34), xi = list(0.08),
            phi = list(c(K1 = 0.71, K3 = 0.1, K2 = 0.19)))
mlHp <- mixML(mixHp, pars = p)
mlHp$mle

## Find the estimated covariance matrix of the MLE
V.Hp <- varEst(mixHp, mlHp$mle, npars = list(rho=1,eta=1,xi=1,phi=1))
V.Hp$cov ## using (rho, eta)
## The summary is a table containing the MLE and their standard errors
summary(V.Hp)

## Assess the fit of the distribution of peak heights
qqpeak(mixHp, pars = mlHp$mle, dist = "conditional")

## Assess the prediction of allele presence
pr <- prequential.score(mixHp, pars = mlHp$mle)
plot(pr)

## Compare observed peak heights to peak heights simulated under the model
sims <- rPeakHeight(mixHp, mlHp$mle, nsim = 100, dist = "conditional")
oldpar <- par(mfrow = c(2,5), mar = c(2,2,2,0))
boxplot(mixHp, sims)
par(oldpar)


data(MC18, USCaucasian, SGMplusDyes)

## A defence hypothesis could be that one unknown person has
## contributed along with K1 and K3.
mixHd <- DNAmixture(list(MC18), k = 3, K = c("K1", "K3"), C = list(50),
                  database = USCaucasian, dyes = list(SGMplusDyes))

## Fit by ML
mlHd <- mixML(mixHd, pars = mixpar(rho = list(30), eta = list(34), xi = list(0.08),
                       phi = list(c(K1 = 0.71, K3 = 0.1, U1 = 0.19))))
mlHd$mle

## Find the estimated covariance matrix of the MLE
V <- varEst(mixHd, mlHd$mle, npars = list(rho=1,eta=1,xi=1,phi=1))
summary(V)

## Assess the peak height distribution
qq <- qqpeak(mixHd, pars = mlHd$mle, dist = "conditional")

## Assess the prediction of allele presence
pr <- prequential.score(mixHd, pars = mlHd$mle)
plot(pr, col = pr$trace)

## Compare observed peak heights to peak heights simulated under the model
sims <- rPeakHeight(mixHd, mlHd$mle, nsim = 100, dist = "conditional")
par(mfrow = c(2,5), mar = c(2,2,2,0))
boxplot(mixHd, sims)


## The log10 of likelihood-ratio can be found as
(mlHp$lik - mlHd$lik)/log(10)


## We can find the most probable genotypes for U1 under the hypothesis
## Hd : K1, K2, and U1.

## Include the peak heights
setPeakInfo(mixHd, mlHd$mle)
## Jointly best for all unknown contributors
mp <- map.genotypes(mixHd, pmin = 0.01, type = "seen")
## See the genotypes rather than allelecounts
print(summary(mp), marker = "D2S1338")

## Include only information on presence and absence of alleles
setPeakInfo(mixHd, mlHd$mle, presence.only = TRUE)
## Jointly best for all unknown contributors
mp2 <- map.genotypes(mixHd, pmin = 0.01, type = "seen")
## See the genotypes rather than allelecounts
print(summary(mp2), marker = "D2S1338")

[Package DNAmixturesLite version 0.0-1 Index]