inferHaplotypes {PolyHaplotyper} | R Documentation |
infer haplotypes for one or more haploblocks
Description
infer haplotypes for one or more haploblocks, for all individuals, using FS family(s) (with parents) if present, and infer haplotypes for non-FS material as well
Usage
inferHaplotypes(mrkDosage, indiv=NULL, ploidy, haploblock,
parents=NULL, FS=NULL, minfrac=c(0.1, 0.01), errfrac=0.025, DRrate=0.025,
maxmrk=0, dropUnused=TRUE, maxparcombs=150000, minPseg=1e-8,
knownHap=integer(0), progress=TRUE, printtimes=FALSE, ahcdir)
Arguments
mrkDosage |
matrix or data.frame. Markers are in rows, individuals in columns, each cell has a marker dosage. Names of individuals are the column names, marker names are the row names or (if a data.frame) in a column named MarkerNames. All marker dosages must be in 0:ploidy or NA. |
indiv |
NULL (default) or a character vector with names of all individuals
to be considered. If NULL, all columns of mrkDosage are selected. |
ploidy |
all marker dosages should be in 0:ploidy or NA |
haploblock |
a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock. Haplotype names are constructed from the haploblock names, which are used as prefixes to which the (zero-padded) haplotype numbers are are appended with separator '_'. |
parents |
a matrix with one row for each FS family and two columns for the two parents, containing the names of the female and male parent of each family. |
FS |
a list of character vectors. Each character vector has the names of the individuals of one FS family. The items of the list should correspond to the rows of the parents matrix, in the same order. |
minfrac |
vector of two fractions, default 0.1 and 0.01. A haplotype is considered to be certainly present if it must occur in at least a fraction minfrac[1] of all individuals; in the final stage for the "other" individuals (those that do not belong to the FS or its parents) this fraction is lowered to minfrac[2]; see also inferHaps_noFS |
errfrac |
default 0.025. The assumed fraction marker genotypes with an error (over all markers in the haploblock). The errors are assumed to be uniformly distributed over all except the original marker dosage combinations (mrkdids) |
DRrate |
default 0.025. The rate of double reduction per meiosis (NOT per allele!); e.g. with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of 0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD (each with a frequency of 0.16) |
maxmrk |
Haploblocks with more than maxmrk markers will be skipped. Default 0: no haploblocks are skipped |
dropUnused |
TRUE (default) if the returned matrix should only contain rows for haplotypes that are present; if FALSE matrix contains rows for all possible haplotypes |
maxparcombs |
Parent 1 and 2 both may have multiple possible haplotype combinations. For each pair of haplotype combinations (one from P1 and one from P2) the expected FS segregation must be checked against the observed. This may take a long time if many such combinations need to be checked. This parameter sets a limit to the number of allowed combinations per haploblock; default 150000 takes about 45 min. |
minPseg |
default 1e-8. The minimum P-value of a chisquared test for segregation in FS families. The best solution for an FS family is selected based on a combination of P-value and number of required haplotypes, among all candidate solutions with a P-value of at least minPseg. If no such solution is found the FS and its parents are treated as unrelated material |
knownHap |
integer vector with haplotype numbers (haplotypes that must be present according to prior inference or knowledge, numbers refer to rows of matrix produced by allHaplotypes); default integer(0), i.e. no known haplotypes |
progress |
if TRUE, and new haplotype combinations need to be calculated, and the number of markers and the ploidy are both >= 6, progress is indicated by printed messages |
printtimes |
if TRUE, the time needed to process each haploblock is printed |
ahcdir |
a single directory, or not specified.
inferHaplotypes uses lists that for each combination of marker
dosages give all possible combinations of haplotype dosages. These lists
(ahclist and ahccompletelist) are loaded and saved at the directory
specified by ahcdir. If no ahcdir is specified it is set to the current
working directory. |
Details
First we consider the case where one or more FS families and their
parents are present in the set of samples. In that case, initially the
possible haplotype configurations of the parents are determined.
From that, all their possible gametes (assuming polysomic
inheritance) are calculated and all possible FS haplotype configurations.
Comparing this with the observed FS marker dosages the most likely parental
and FS configurations are found.
It is possible that multiple parental combinations can explain the observed
marker dosages in the FS. In that case, if one is clearly more likely and/or
needs less haplotypes, that one is chosen. If there is no clear best solution
still the parents and FS individuals that have the same haplotype
configuration over all likely solutions are assigned that configuration.
For FS where no good solution is found (because of an error in the marker
dosages of a parent, or because the correct solution was not considered) the
parents and individuals will be considered as unrelated material.
If several FS families share common parents they are treated as a group,
and only solutions are considered that are acceptable for all families
in the group.
Finally (or if no FS families are present, immediately) the other samples
are haplotyped, which are considered as unrelated material. If FS families
have been solved the haplotypes in their parents are considered "known",
and known haplotypes can also be supplied (parameter knownHap). For these
samples we consecutively add haplotypes that must be present in a minimum
number of individuals, always trying to minimize the number of needed
haplotypes.
InferHaplotypes uses tables that, for each combination of dosages of the
markers in the haploblock, list all haplotype combinations (ahc) that
result in these marker dosages. In principle inferHaplotypes uses a list
(ahccompletelist) that, for a given ploidy, has all the haplotype combinations
for haploblocks from 1 up to some maximum number of markers. This list can be
computed with function build_ahccompletelist. If this list is not available
(or is some haploblocks contain more markers than the list), the ahc for
the (extra) marker.
See the PolyHaplotyper vignette for an illustrated explanation.
Value
a list with for each haploblock one item that itself is a list
with items:
message; if this is "" the haploblock is processed and further
elements are present; else this message says why the haploblock was
skipped (currently only if it contains too many markers)
hapdos: a matrix with the dosages of each haplotype (in rows) for each
individual (in columns). For each individual the haplotype dosages
sum to the ploidy. If dropUnused is TRUE Only the haplotypes that
occur in the population are shown, else all haplotypes
mrkdids: a vector of the mrkdid (marker dosage ID) for each individual
(each combination of marker dosages has its own ID; if any of the
markers has an NA dosage the corresponding mrkdid is also NA).
The mrkdids can be converted to the marker dosages with function
mrkdid2mrkdos.
markers: a vector with the names of the markers in the haploblock
imputedGeno: a matrix in the same format as param mrkDosage, with one row
for each marker in the haploblock and one column per imputed
individual, with the dosages of the markers. These are the individuals
that have incomplete data in mrkDosage but where the available marker
dosages match only one of the expected marker genotypes in the FS
family (only individuals in FS families are imputed). It is possible
that an individual with imputed marker dosages is not haplotyped (as
is the case for individuals with complete marker data) if the
marker dosages match different possible haplotype combinations.
The next elements are only present if one or more FS families were
specified:
FSfit: a logical vector with one element per FS family; TRUE if a (or
more than one) acceptable solution for the FS is found (although
if multiple solution are found they might not be used if unclear
which one is the best solution). (Even if no
solution was found for an FS, still its individuals may have a
haplotype combination assigned ignoring their pedigree)
FSmessages: a character vector with one item per FS family: any
message relating to the fitting of a model for that FS,
not necessarily an error
FSpval: a vector of the chi-squared P-value associated with the selected
FS model for each FS family, or the maximum P value over all
models in case none was selected
If for new combinations of marker dosages the possible haplotype combinations
have to be calculated, an ahclist file is written to ahcdir
Examples
# this example takes about 1 minute to run:
data(PolyHaplotyper_small)
results <- inferHaplotypes(mrkDosage=phdos, ploidy=6,
haploblock=phblocks, parents=phpar, FS=phFS)
names(results)
names(results[[1]])