R: Confidence Probabilities

EstConf {sequoia}

R Documentation

Confidence Probabilities

Description

Estimate confidence probabilities ('backward') and assignment error rates ('forward') per category (genotyped/dummy) by repeatedly simulating genotype data from a reference pedigree using SimGeno, reconstruction a pedigree from this using sequoia, and counting the number of mismatches using PedCompare.

Usage

EstConf(
  Pedigree = NULL,
  LifeHistData = NULL,
  args.sim = list(nSnp = 400, SnpError = 0.001, ParMis = c(0.4, 0.4)),
  args.seq = list(Module = "ped", Err = 0.001, Tassign = 0.5, CalcLLR = FALSE),
  nSim = 10,
  nCores = 1,
  quiet = TRUE
)

Arguments

`Pedigree`	reference pedigree from which to simulate, dataframe with columns id-dam-sire. Additional columns are ignored.
`LifeHistData`	dataframe with id, sex (1=female, 2=male, 3=unknown), birth year, and optionally BY.min - BY.max - YearLast.
`args.sim`	list of arguments to pass to `SimGeno`, such as `nSnp` (number of SNPs), `SnpError` (genotyping error rate) and `ParMis` (proportion of non-genotyped parents). Set to `NULL` to use all default values.
`args.seq`	list of arguments to pass to `sequoia`, such as `Module` ('par' or 'ped'), `Err` (assumed genotyping error rate), and `Complex`. May include (part of) `SeqList`, a list of sequoia output (i.e. as a list-within-a-list). Set to `NULL` to use all default values.
`nSim`	number of iterations of simulate - reconstruct - compare to perform, i.e. number of simulated datasets.
`nCores`	number of computer cores to use. If `>1`, package parallel is used. Set to NULL to use all but one of the available cores, as detected by `parallel::detectCores()` (using all cores tends to freeze up your computer). With large datasets, the amount of computer memory may be the limiting factor for the number of cores you can use.
`quiet`	suppress messages. `TRUE` runs `SimGeno` and `sequoia` quietly, `'very'` also suppresses other messages and the iteration counter when `nCores=1` (there is no iteration counter when `nCores>1`).

Details

The confidence probability is taken as the number of correct (matching) assignments, divided by all assignments made in the observed (inferred-from-simulated) pedigree. In contrast, the false negative & false positive assignment rates are proportions of the number of parents in the true (reference) pedigree. Each rate is calculated separatedly for dams & sires, and separately for each category (Genotyped/Dummy(fiable)/X (none)) of individual, parent and co-parent.

This function does not know which individuals in the actual Pedigree are genotyped, so the confidence probabilities need to be added to the Pedigree as shown in the example at the bottom.

A confidence of 1 means all assignments on simulated data were correct for that category-combination. It should be interpreted as (and perhaps modified to) > 1 - 1/N, where sample size N is given in the last column of the ConfProb and PedErrors dataframes in the output. The same applies for a false negative/positive rate of 0 (i.e. to be interpreted as < 1/N).

Value

A list, with elements:

`ConfProb`	See below
`PedErrors`	See below
`Pedigree.reference`	the pedigree from which data was simulated
`LifeHistData`
`Pedigree.inferred`	a list with for each iteration the inferred pedigree based on the simulated data
`SimSNPd`	a list with for each iteration the IDs of the individuals simulated to have been genotyped
`PedComp.fwd`	array with `Counts` from the 'forward' `PedCompare`, from which `PedErrors` is calculated
`RunParams`	a list with the call to `EstConf` as a semi-nested list (args.sim, args.seq, nSim, nCores), as well as the default parameter values for `SimGeno` and `sequoia`.
`RunTime`	`sequoia` runtime per simulation in seconds, as measured by `system.time()['elapsed']`.

Dataframe ConfProb has 7 columns:

`id.cat`, `dam.cat`, `sire.cat`	Category of the focal individual, dam, and sire, in the pedigree inferred based on the simulated data. Coded as G=genotyped, D=dummy, X=none
`dam.conf`	Probability that the dam is correct, given the categories of the assigned dam and sire (ignoring whether or not the sire is correct)
`sire.conf`	as `dam.conf`, for the sire
`pair.conf`	Probability that both dam and sire are correct, given their categories
`N`	Number of individuals per category-combination, across all `nSim` iterations

Array PedErrors has three dimensions:

`class`	`FalseNeg`(atives): could have been assigned but was not (individual + parent both genotyped or dummyfiable; P1only in `PedCompare`). `FalsePos`(itives): no parent in reference pedigree, but one was assigned based on the simulated data (P2only) `Mismatch`: different parents between the pedigrees
`cat`	Category of individual + parent, as a two-letter code where the first letter indicates the focal individual and the second the parent; G=Genotyped, D=Dummy, T=Total
`parent`	dam or sire

Assumptions

Because the actual true pedigree is (typically) unknown, the provided reference pedigree is used as a stand-in and assumed to be the true pedigree, with unrelated founders. It is also assumed that the probability to be genotyped is equal for all parents; in each iteration, a new random set of parents (proportion set by ParMis) is mimicked to be non-genotyped. In addition, SNPs are assumed to segregate independently.

An experimental version offering more fine-grained control is available at https://github.com/JiscaH/sequoiaExtra .

Object size

The size in Kb of the returned list can become pretty big, as each of the inferred pedigrees is included. When running EstConf many times for a range of parameter values, it may be prudent to save the required summary statistics for each run rather than the full output.

Errors

If you have a large pedigree and try to run this function on multiple cores, you may run into "Cannot allocate vector of size ..." errors or even unexpected crashes: there is not enough computer memory for each separate run. Try reducing 'nCores'.

Examples

# estimate proportion of parents that are genotyped (= 1 - ParMis)
sumry_grif <- SummarySeq(SeqOUT_griffin, Plot=FALSE)
tmp <- apply(sumry_grif$ParentCount['Genotyped',,,],
             MARGIN = c('parentSex', 'parentCat'), FUN = sum)
props <- sweep(tmp, MARGIN='parentCat', STATS = rowSums(tmp), FUN = '/')
1 - props[,'Genotyped'] / (props[,'Genotyped'] + props[,'Dummy'])

# Example for parentage assignment only
conf_grif <- EstConf(Pedigree = SeqOUT_griffin$Pedigree,
               LifeHistData = SeqOUT_griffin$LifeHist,
               args.sim = list(nSnp = 150,   # no. in actual data, or what-if
                               SnpError = 5e-3,  # best estimate, or what-if
                               CallRate=0.9,     # from SnpStats()
                               ParMis=c(0.39, 0.20)),  # calc'd above
               args.seq = list(Err=5e-3, Module="par"),  # as in real run
               nSim = 1,   # try-out, proper run >=20 (10 if huge pedigree)
               nCores=1)

# parent-pair confidence, per category (Genotyped/Dummy/None)
conf_grif$ConfProb

# Proportion of true parents that was correctly assigned
1 - apply(conf_grif$PedErrors, MARGIN=c('cat','parent'), FUN=sum, na.rm=TRUE)

# add columns with confidence probabilities to pedigree
# first add columns with category (G/D/X)
Ped.withConf <- getAssignCat(Pedigree = SeqOUT_griffin$Pedigree,
                             SNPd = SeqOUT_griffin$PedigreePar$id)
Ped.withConf <- merge(Ped.withConf, conf_grif$ConfProb, all.x=TRUE,
                      sort=FALSE)  # (note: merge() messes up column order)
head(Ped.withConf[Ped.withConf$dam.cat=="G", ])

# save output summary
## Not run: 
conf_griff[['Note']] <- 'You could add a note'
saveRDS(conf_grif[c('ConfProb','PedComp.fwd','RunParams','RunTime','Note')],
   file = 'conf_200SNPs_Err005_Callrate80.RDS')

## End(Not run)

## P(actual FS | inferred as FS) etc.
## Not run: 
PairL <- list()
for (i in 1:length(conf_grif$Pedigree.inferred)) {  # nSim
  cat(i, "\t")
  PairL[[i]] <- ComparePairs(conf_grif$Pedigree.reference,
                             conf_grif$Pedigree.inferred[[i]],
                             GenBack=1, patmat=TRUE, ExcludeDummies = TRUE,
                             Return="Counts")
}
# P(actual relationship (Ped1) | inferred relationship (Ped2))
PairRel.prop.A <- plyr::laply(PairL, function(M)
                     sweep(M, MARGIN='Ped2', STATS=colSums(M), FUN="/"))
PairRel.prop <- apply(PairRel.prop.A, 2:3, mean, na.rm=TRUE) #avg across sims
round(PairRel.prop, 3)
# or: P(inferred relationship | actual relationship)
PairRel.prop2 <- plyr::laply(PairL, function(M)
   sweep(M, MARGIN='Ped1', STATS=rowSums(M), FUN="/"))

## End(Not run)

## Not run: 
# confidence probability vs. sibship size
source('https://raw.githubusercontent.com/JiscaH/sequoiaExtra/main/conf_vs_sibsize.R')
conf_grif_nOff <- Conf_by_nOff(conf_grif)
conf_grif_nOff['conf',,'GD',]
conf_grif_nOff['N',,'GD',]

## End(Not run)

[Package sequoia version 2.11.2 Index]