seqVsInsitu {cellOrigins}R Documentation

Determine the most likely source(s) of a tissue-specific RNAseq dataset

Description

Compares tissue-specific RNA sequencing coverage with high-throughput RNA in situ hybridisation patterns of gene expression. All pattern combinations are tested in an exhaustive search.

Usage

seqVsInsitu(seq_signature, depth = 2, insitu = cellOrigins::BDGP_insitu_dmel_embryo,
  insitu_discovery_function = discovery.log, saturate = 500,
  prior = prior.temporal_proximity_is_good)

Arguments

seq_signature

A named vector containing FPKM RNAseq data. Each element name must correspond to the names used in the insitu argument. NAs are permitted.

depth

Number of RNA in situ expression patterns to combine to identify mixed populations. If 1, the expression patterns as given are used. Otherwise all combinations of depth expression patterns are tried. Each term combined with itself is also tested i.e. pure populations will still be identified if depth>1. Defaults to 2. seqVsInsitu Depths > 2 can be slow. iterating_seqVsInsitu is much faster in these cases.

insitu

Matrix with RNA in situ hybridisation results. Rows are transcript names (same names as used for seq_signature) and coloumns are anatomical terms (possibly combined with developmental stages). 1 denotes staining of a particular transcript in a particular tissue, 0 denotes no staining. Defaults to BDGP_insitu_dmel_embryo, a staining dataset for Drosophila melanogaster embryos.

insitu_discovery_function

A function that converts FPKM values to the probability of discovery by RNA in situ hybridisation. Probabilities must be ]0..1[, the values 0 and 1 are not permitted. Defaults to discovery.log, an approximation of empirically determined discovery probabilities. Other available functions are discovery.linear and discovery.identic.

saturate

Will be passed on to the insitu_discovery_function. The data set dependent maximum value at which discovery probability should saturate. Defaults to 500 (FPKM).

prior

A function that returns the log2 prior probability of each anatomic term or combination of terms. Defaults to prior.temporal_proximity_is_good, which works well with BDGP_insitu_dmel_embryo. prior.all_equal assumes that all terms are equally probable.

Details

First, the function calculates for each sequenced transcript how likely it is that it would produce an RNA in situ signal, given its expresion strength. Using these staining probabilities and Bayes's rule the function then calculates the probability score for each of the given RNA in situ hybridisation patterns that it was produced by the same gene expression pattern as the sequenced transcriptome.

If depth>1 then the function identifies the origins of not pure sequenced material. For that it merges multiple RNA in situ hybridisation patterns for comparison with the sequenced data. This simulates the outcome of cell populations mixing.

seq_signature is best generated by taking the mean coverage of the regions which are actually tested with the RNA in situ hybridisation probes. This circumvents problems from misannotation, overlapping transcripts and faulty quantitation of individual transcripts from sequencing data. A protocol for generating such datasets is given in the package reference.

Value

A matrix with a row for each anatomical term (or combination of terms) and at least four columns. The terms are sorted by the posterior value and the top term is the most likely source of the RNAseq transcriptome.

posterior

A log2 posterior probability score. The highest value is given to the most likely tissue of origin. The value is only meaningful in comparison with other values within the same result set.

prior

Prior probability of the anatomical term(s), as given by the function prior.

likelihood.from.absence.insitu

Probability score from all the genes where RNA in situ hybridisation did not report staining.

likelihood.from.presence.insitu

Probability score from all the genes where in situ hybridisation reported staining.

remaining coloumns

Number of additional expressed genes added to the in situ signature with each term in the tested combination. Sometimes additional terms add only very few or no new genes at all. Such tissue contributions are meaningless artefacts.

The posterior column is the sum of the other three named columns. The scores are proportional to the (unknown) probabilities of identity.

See Also

iterating_seqVsInsitu, BDGP_insitu_dmel_embryo, discovery.log, discovery.linear, discovery.identic, prior.temporal_proximity_is_good, prior.all_equal, diagnosticPlots.

Examples

fpath <- system.file("extdata", "vncMedianCoverage.tsv", package="cellOrigins")
vncExpression <- read.delim(file = fpath, header=FALSE, as.is=TRUE)

expression <- vncExpression$V2
names(expression) <- vncExpression$V1

result <- seqVsInsitu(expression, depth=1)

[Package cellOrigins version 0.1.3 Index]