EstConf {sequoia} | R Documentation |
Confidence Probabilities
Description
Estimate confidence probabilities ('backward') and assignment
error rates ('forward') per category (genotyped/dummy) by repeatedly
simulating genotype data from a reference pedigree using
SimGeno
, reconstruction a pedigree from this using
sequoia
, and counting the number of mismatches using
PedCompare
.
Usage
EstConf(
Pedigree = NULL,
LifeHistData = NULL,
args.sim = list(nSnp = 400, SnpError = 0.001, ParMis = c(0.4, 0.4)),
args.seq = list(Module = "ped", Err = 0.001, Tassign = 0.5, CalcLLR = FALSE),
nSim = 10,
nCores = 1,
quiet = TRUE
)
Arguments
Pedigree |
reference pedigree from which to simulate, dataframe with columns id-dam-sire. Additional columns are ignored. |
LifeHistData |
dataframe with id, sex (1=female, 2=male, 3=unknown), birth year, and optionally BY.min - BY.max - YearLast. |
args.sim |
list of arguments to pass to |
args.seq |
list of arguments to pass to |
nSim |
number of iterations of simulate - reconstruct - compare to perform, i.e. number of simulated datasets. |
nCores |
number of computer cores to use. If |
quiet |
suppress messages. |
Details
The confidence probability is taken as the number of correct (matching) assignments, divided by all assignments made in the observed (inferred-from-simulated) pedigree. In contrast, the false negative & false positive assignment rates are proportions of the number of parents in the true (reference) pedigree. Each rate is calculated separatedly for dams & sires, and separately for each category (Genotyped/Dummy(fiable)/X (none)) of individual, parent and co-parent.
This function does not know which individuals in the actual Pedigree
are genotyped, so the confidence probabilities need to be added to the
Pedigree
as shown in the example at the bottom.
A confidence of 1
means all assignments on simulated data were correct for
that category-combination. It should be interpreted as (and perhaps modified
to) > 1 - 1/N
, where sample size N
is given in the last column
of the ConfProb
and PedErrors
dataframes in the output. The
same applies for a false negative/positive rate of 0
(i.e. to be
interpreted as < 1/N
).
Value
A list, with elements:
ConfProb |
See below |
PedErrors |
See below |
Pedigree.reference |
the pedigree from which data was simulated |
LifeHistData |
|
Pedigree.inferred |
a list with for each iteration the inferred pedigree based on the simulated data |
SimSNPd |
a list with for each iteration the IDs of the individuals simulated to have been genotyped |
PedComp.fwd |
array with |
RunParams |
a list with the call to |
RunTime |
|
Dataframe ConfProb
has 7 columns:
id.cat , dam.cat , sire.cat |
Category of the focal individual, dam, and sire, in the pedigree inferred based on the simulated data. Coded as G=genotyped, D=dummy, X=none |
dam.conf |
Probability that the dam is correct, given the categories of the assigned dam and sire (ignoring whether or not the sire is correct) |
sire.conf |
as |
pair.conf |
Probability that both dam and sire are correct, given their categories |
N |
Number of individuals per category-combination, across all
|
Array PedErrors
has three dimensions:
class |
|
cat |
Category of individual + parent, as a two-letter code where the first letter indicates the focal individual and the second the parent; G=Genotyped, D=Dummy, T=Total |
parent |
dam or sire |
Assumptions
Because the actual true pedigree is (typically) unknown, the provided
reference pedigree is used as a stand-in and assumed to be the true
pedigree, with unrelated founders. It is also assumed that the probability
to be genotyped is equal for all parents; in each iteration, a new random
set of parents (proportion set by ParMis
) is mimicked to be
non-genotyped. In addition, SNPs are assumed to segregate independently.
An experimental version offering more fine-grained control is available at https://github.com/JiscaH/sequoiaExtra .
Object size
The size in Kb of the returned list can become pretty big, as each of the
inferred pedigrees is included. When running EstConf
many times for
a range of parameter values, it may be prudent to save the required summary
statistics for each run rather than the full output.
Errors
If you have a large pedigree and try to run this function on multiple cores, you may run into "Cannot allocate vector of size ..." errors or even unexpected crashes: there is not enough computer memory for each separate run. Try reducing 'nCores'.
See Also
Examples
# estimate proportion of parents that are genotyped (= 1 - ParMis)
sumry_grif <- SummarySeq(SeqOUT_griffin, Plot=FALSE)
tmp <- apply(sumry_grif$ParentCount['Genotyped',,,],
MARGIN = c('parentSex', 'parentCat'), FUN = sum)
props <- sweep(tmp, MARGIN='parentCat', STATS = rowSums(tmp), FUN = '/')
1 - props[,'Genotyped'] / (props[,'Genotyped'] + props[,'Dummy'])
# Example for parentage assignment only
conf_grif <- EstConf(Pedigree = SeqOUT_griffin$Pedigree,
LifeHistData = SeqOUT_griffin$LifeHist,
args.sim = list(nSnp = 150, # no. in actual data, or what-if
SnpError = 5e-3, # best estimate, or what-if
CallRate=0.9, # from SnpStats()
ParMis=c(0.39, 0.20)), # calc'd above
args.seq = list(Err=5e-3, Module="par"), # as in real run
nSim = 1, # try-out, proper run >=20 (10 if huge pedigree)
nCores=1)
# parent-pair confidence, per category (Genotyped/Dummy/None)
conf_grif$ConfProb
# Proportion of true parents that was correctly assigned
1 - apply(conf_grif$PedErrors, MARGIN=c('cat','parent'), FUN=sum, na.rm=TRUE)
# add columns with confidence probabilities to pedigree
# first add columns with category (G/D/X)
Ped.withConf <- getAssignCat(Pedigree = SeqOUT_griffin$Pedigree,
SNPd = SeqOUT_griffin$PedigreePar$id)
Ped.withConf <- merge(Ped.withConf, conf_grif$ConfProb, all.x=TRUE,
sort=FALSE) # (note: merge() messes up column order)
head(Ped.withConf[Ped.withConf$dam.cat=="G", ])
# save output summary
## Not run:
conf_griff[['Note']] <- 'You could add a note'
saveRDS(conf_grif[c('ConfProb','PedComp.fwd','RunParams','RunTime','Note')],
file = 'conf_200SNPs_Err005_Callrate80.RDS')
## End(Not run)
## P(actual FS | inferred as FS) etc.
## Not run:
PairL <- list()
for (i in 1:length(conf_grif$Pedigree.inferred)) { # nSim
cat(i, "\t")
PairL[[i]] <- ComparePairs(conf_grif$Pedigree.reference,
conf_grif$Pedigree.inferred[[i]],
GenBack=1, patmat=TRUE, ExcludeDummies = TRUE,
Return="Counts")
}
# P(actual relationship (Ped1) | inferred relationship (Ped2))
PairRel.prop.A <- plyr::laply(PairL, function(M)
sweep(M, MARGIN='Ped2', STATS=colSums(M), FUN="/"))
PairRel.prop <- apply(PairRel.prop.A, 2:3, mean, na.rm=TRUE) #avg across sims
round(PairRel.prop, 3)
# or: P(inferred relationship | actual relationship)
PairRel.prop2 <- plyr::laply(PairL, function(M)
sweep(M, MARGIN='Ped1', STATS=rowSums(M), FUN="/"))
## End(Not run)
## Not run:
# confidence probability vs. sibship size
source('https://raw.githubusercontent.com/JiscaH/sequoiaExtra/main/conf_vs_sibsize.R')
conf_grif_nOff <- Conf_by_nOff(conf_grif)
conf_grif_nOff['conf',,'GD',]
conf_grif_nOff['N',,'GD',]
## End(Not run)