R: Calculate distance between two gene expression data sets

disteg {lineup}

R Documentation

Calculate distance between two gene expression data sets

Description

Calculate a distance between all pairs of individuals for two gene expression data sets

Usage

disteg(
  cross,
  pheno,
  pmark,
  min.genoprob = 0.99,
  k = 20,
  min.classprob = 0.8,
  classprob2drop = 1,
  repeatKNN = TRUE,
  max.selfd = 0.3,
  phenolabel = "phenotype",
  weightByLinkage = FALSE,
  map.function = c("haldane", "kosambi", "c-f", "morgan"),
  verbose = TRUE
)

Arguments

`cross`	An object of class `"cross"` containing data for a QTL experiment. See the help file for `qtl::read.cross()` in the R/qtl package (https://rqtl.org). There must be a phenotype named `"id"` or `"ID"` that contains the individual identifiers.
`pheno`	A data frame of phenotypes (generally gene expression data), stored as individuals x phenotypes. The row names must contain individual identifiers.
`pmark`	Pseudomarkers that are closest to the genes in `pheno`, as output by `find.gene.pseudomarker()`.
`min.genoprob`	Threshold on genotype probabilities; if maximum probability is less than this, observed genotype taken as `NA`.
`k`	Number of nearest neighbors to consider in forming a k-nearest neighbor classifier.
`min.classprob`	Minimum proportion of neighbors with a common class to make a class prediction.
`classprob2drop`	If an individual is inferred to have a genotype mismatch with classprob > this value, treat as an outlier and drop from the analysis and then repeat the KNN construction without it.
`repeatKNN`	If TRUE, repeat k-nearest neighbor a second time, after omitting individuals who seem to not be self-self matches
`max.selfd`	Min distance from self (as proportion of mismatches between observed and predicted eQTL genotypes) to be excluded from the second round of k-nearest neighbor.
`phenolabel`	Label for expression phenotypes to place in the output distance matrix.
`weightByLinkage`	If TRUE, weight the eQTL to account for their relative positions (for example, two tightly linked eQTL would each count about 1/2 of an isolated eQTL)
`map.function`	Used if `weightByLinkage` is TRUE
`verbose`	if TRUE, give verbose output.

Details

We consider the expression phenotypes in batches, by which pseudomarker they are closest to. For each batch, we pull the genotype probabilities at the corresponding pseudomarker and use the individuals that are in common between cross and pheno and whose maximum genotype probability is above min.genoprob, to form a classifier of eQTL genotype from expression values, using k-nearest neighbor (the function class::knn()). The classifier is applied to all individuals with expression data, to give a predicted eQTL genotype. (If the proportion of the k nearest neighbors with a common class is less than min.classprob, the predicted eQTL genotype is left as NA.)

If repeatKNN is TRUE, we repeat the construction of the k-nearest neighbor classifier after first omitting individuals whose proportion of mismatches between observed and inferred eQTL genotypes is greater than max.selfd.

Finally, we calculate the distance between the observed eQTL genotypes for each individual in cross and the inferred eQTL genotypes for each individual in pheno, as the proportion of mismatches between the observed and inferred eQTL genotypes.

If weightByLinkage is TRUE, we use weights on the mismatch proportions for the various eQTL, taking into account their linkage. Two tightly linked eQTL will each be given half the weight of a single isolated eQTL.

Value

A matrix with nind(cross) rows and nrow(pheno) columns, containing the distances. The individual IDs are in the row and column names. The matrix is assigned class "lineupdist".

The names of the genes that were used to construct the classifier are saved in an attribute "retained".

The observed and inferred eQTL genotypes are saved as attributes "obsg" and "infg".

The denominators of the proportions that form the inter-individual distances are in the attribute "denom".

Author(s)

Karl W Broman, broman@wisc.edu

Examples

library(qtl)

# load example data
data(f2cross, expr1, pmap, genepos)


# calculate QTL genotype probabilities
f2cross <- calc.genoprob(f2cross, step=1)

# find nearest pseudomarkers
pmark <- find.gene.pseudomarker(f2cross, pmap, genepos)

# line up individuals
id <- findCommonID(f2cross, expr1)

# calculate LOD score for local eQTL
locallod <- calc.locallod(f2cross[,id$first], expr1[id$second,], pmark)

# take those with LOD > 25
expr1s <- expr1[,locallod>25,drop=FALSE]

# calculate distance between individuals
#     (prop'n mismatches between obs and inferred eQTL geno)
d <- disteg(f2cross, expr1s, pmark)

# plot distances
plot(d)

# summary of apparent mix-ups
summary(d)

# plot of classifier for and second eQTL
par(mfrow=c(2,1), las=1)
plotEGclass(d)
plotEGclass(d, 2)

[Package lineup version 0.44 Index]