R: Extremely fast linkage map construction for data frame...

mstmap.data.frame {ASMap}

R Documentation

Extremely fast linkage map construction for data frame objects using MSTmap.

Description

Extremely fast linkage map construction for data frame objects utilizing the source code for MSTmap (see Wu et al., 2008). The construction includes linkage group clustering, marker ordering and genetic distance calculations.

Usage

## S3 method for class 'data.frame'
mstmap(object, pop.type = "DH", dist.fun = "kosambi",
      objective.fun = "COUNT", p.value = 1e-06, noMap.dist = 15,
      noMap.size = 0, miss.thresh = 1, mvest.bc = FALSE,
      detectBadData = FALSE, as.cross = TRUE, return.imputed = FALSE,
      trace = FALSE, ...)

Arguments

`object`	A `"data.frame"` object containing marker information. The data.frame must explicitly be arranged with markers in rows and genotypes in columns. Marker names are obtained from the `rownames` of the `object` and genotype names are obtained from the `names` component of the `object` (see Details).
`pop.type`	Character string specifying the population type of the data frame `object`. Accepted values are `"DH"` (doubled haploid), `"BC"` (backcross), `"RILn"` (non-advanced RIL population with n generations of selfing) and `"ARIL"` (advanced RIL) (see Details). Default is `"DH"`.
`dist.fun`	Character string defining the distance function used for calculation of genetic distances. Options are `"kosambi"` and `"haldane"`. Default is `"kosambi"`.
`objective.fun`	Character string defining the objective function to be used when constructing the map. Options are `"COUNT"` for minimising the sum of recombination events between markers and `"ML"` for maximising the likelihood objective function. Default is `"COUNT"`.
`p.value`	Numerical value to specify the threshold to use when clustering markers. Defaults to `1e-06`. If a value greater than one is given this feature is turned off inputted marker data are assumed to belong to the same linkage group (see Details).
`noMap.dist`	Numerical value to specify the smallest genetic distance a set of isolated markers can appear distinct from other linked markers. Isolated markers will appear in their own linkage groups ad will be of size specified by `noMap.size`.
`noMap.size`	Numerical value to specify the maximum size of isolated marker linkage groups that have been identified using `noMap.dist`. This feature can be turned off by setting it to 0. Default is 0.
`miss.thresh`	Numerical value to specify the threshold proportion of missing marker scores allowable in each of the markers. Markers above this threshold will not be included in the linkage map. Default is 1.
`mvest.bc`	Logical value. If `TRUE` missing markers will be imputed before clustering the markers into linkage groups. This is restricted to `"BC","DH","ARIL"` populations only (see Details).
`detectBadData`	Logical value. If `TRUE` possible genotyping errors are detected, set to missing and then imputed as part of the marker ordering algorithm. Genotyping errors will also be printed in the file specified by `trace`. This is restricted to `"BC","DH","ARIL"` populations only. (see Details). Default is `FALSE`.
`as.cross`	Logical value. If `TRUE` the constructed linkage map is returned as a qtl cross object (see Details). If `FALSE` then the constructed linkage map is returned as a `data.frame` with extra columns indicating the linkage group, marker name/position and genetic distance. Default is `TRUE`.
`return.imputed`	Logical value. If `TRUE` then the imputed marker probability matrix is returned for the linkage groups that are constructed (see Details). Default is `FALSE`.
`trace`	An automatic tracing facility. If `trace = FALSE` then minimal `MSTmap` output is piped to the screen during the algorithm. If `trace = TRUE`, then detailed output from MSTmap is piped to "`MSToutput.txt`". This file is equivalent to the output that would be obtained from running the MSTmap executable from the command line.
`...`	Currently ignored.

Details

The data frame object must have an explicit format with markers in rows and genotypes in columns. The marker names are required to be in the rownames component and the genotype names are required to be in the names component of the object. In each set of names there must be no spaces. If spaces are detected they are exchanged for a "-". Each of the columns of the data frame must be of class "character" (not factors). If converting from a matrix, this can easily be achieved by using the stringAsFactors = FALSE argument for any data.frame method.

It is important to know what population type the data frame object is and to correctly input this into pop.type. If pop.type = "ARIL" then it is assumed that the minimal number of heterozygotes have been set to missing before proceeding. The advanced RIL population is then treated like a backcross population for the purpose of linkage map construction. Genetic distances are adjusted post construction. For non-advanced RIL populations pop.type = "RILn", the number of generations of selfing is limited to 20 to ensure sensible input.

The content of the markers in object can either be all numeric (see below) or all character. If markers are of type character then the following allelic content must be explicitly adhered to. For pop.type "BC", "DH" or "ARIL" the two allele types should be represented as ("A" or "a") and ("B" or "b"). For non-advanced RIL populations (pop.type = "RILn") phase unknown heterozygotes should be represented as "X". For all populations, missing marker scores should be represented as ("U" or "-").

This function also extends the functionality of the MSTmap algorithm by allowing users to input a complete numeric data frame of marker probabilities for pop.type "BC", "DH" or "ARIL". The values must be inclusively between 1 (A) and 0 (B) and be representative of the probability that the A allele is present. No missing values are allowed.

The algorithm allows an adjustment of the p.value threshold for clustering of markers to distinct linkage groups (see Wu et al., 2008) and is highly dependent on the number of individuals in the population. As the number of individuals increases the p.value threshold should be decreased accordingly. This may require some trial and error to achieve desired results.

If mvest.bc = TRUE and the population type is "BC","DH","ARIL" then missing values are imputed before markers are clustered into linkage groups. This is only a simple imputation that places a 0.5 probability of the missing observation being one allele or the other and is used to assist the clustering algorithm when there is known to be high numbers of missing observations between pairs of markers.

It should be highlighted that for population types "BC","DH","ARIL", imputation of missing values occurs regardless of the value of mvest.bc. This is achieved using an EM algorithm that is tightly coupled with marker ordering (see Wu et al., 2008). Initially a marker order is obtained omitting missing marker scores and then imputation is performed based on the underlying recombinant probabilities of the flanking markers with the markers containing the missing value. The recombinant probabilities are then recomputed and an update of the pairwise distances are calculated. The ordering algorithm is then run again and the complete process is repeated until convergence. Note, the imputed probability matrix for the linkage map being constructed is returned if return.imputed = TRUE.

For populations "BC","DH","ARIL", if detectBadData = TRUE, the marker ordering algorithm also includes the detection of genotyping errors. For any individual genotype, the detection method is based on a weighted Euclidean metric (see Wu et al., 2008) that is a function of the recombination probabilities of all the markers with the marker containing the suspicious observation. Any genotyping errors detected are set to missing and the missing values are then imputed if mv.est = TRUE. Note, the detection of these errors and their amendment is returned in the imputed probability matrix if return.imputed = TRUE

If as.cross = TRUE then the constructed object is returned as a qtl cross object with the appropriate class structure. For "RILn" populations the constructed object is given the class "bcsft" by using the qtl package conversion function convert2bcsft with arguments F.gen = n and BC.gen = 0. For "ARIL" populations the constructed object is given the class "riself".

If return.imputed = TRUE and pop.type is one of "BC","DH","ARIL", then the marker probability matrix is returned for the linkage groups that have been constructed using the algorithm. Each linkage group is named identically to the linkage groups of the map and, if as.cross = TRUE, contains an ordered "map" element and a "data" element consisting of marker probabilities of the A allele being present (i.e. P(A) = 1, P(B) = 0). Both elements contain a possibly reduced version of the marker set that includes all non-colocating markers as well as the first marker of any set of co-locating markers. If as.cross = FALSE then an ordered data frame of matrix probabilities is returned.

Value

If as.cross = TRUE the function returns an R/qtl cross object with the appropriate class structure. The object is a list with usual components "pheno" and "geno". If as.cross = FALSE the function returns an ordered data frame object with additional columns that indicate the linkage group, the position and marker names and genetic distance of the markers within in each linkage group. If markers were omitted for any reason during the construction, the object will have an "omit" component with all omitted markers in a collated matrix. If return.imputed = TRUE then the object will also contain an "imputed.geno" element.

Author(s)

Julian Taylor, Dave Butler, Timothy Close, Yonghui Wu, Stefano Lonardi

References

Wu, Y., Bhat, P., Close, T.J, Lonardi, S. (2008) Efficient and Accurate Construction of Genetic Linkage Maps from Minimum Spanning Tree of a Graph. Plos Genetics, 4, Issue 10.

Taylor, J., Butler, D. (2017) R Package ASMap: Efficient Genetic Linkage Map Construction and Diagnosis. Journal of Statistical Software, 79(6), 1–29.

Examples


data(mapDH, package = "ASMap")

## forming data frame object from R/qtl object

dfg <- t(do.call("cbind", lapply(mapDH$geno, function(el) el$data)))
dimnames(dfg)[[2]] <- as.character(mapDH$pheno[["Genotype"]])
dfg <- dfg[sample(1:nrow(dfg), nrow(dfg), replace = FALSE),]
dfg[dfg == 1] <- "A"
dfg[dfg == 2] <- "B"
dfg[is.na(dfg)] <- "U"
dfg <- cbind.data.frame(dfg, stringsAsFactors = FALSE)

## construct map

testd <- mstmap(dfg, dist.fun = "kosambi", trace = FALSE)
pull.map(testd)

## let's get a timing on that ...

system.time(testd <- mstmap(dfg, dist.fun = "kosambi", trace = FALSE))

[Package ASMap version 1.0-7 Index]