R: Determine the most likely source(s) of a tissue-specific...

seqVsInsitu {cellOrigins}

R Documentation

Determine the most likely source(s) of a tissue-specific RNAseq dataset

Description

Compares tissue-specific RNA sequencing coverage with high-throughput RNA in situ hybridisation patterns of gene expression. All pattern combinations are tested in an exhaustive search.

Usage

seqVsInsitu(seq_signature, depth = 2, insitu = cellOrigins::BDGP_insitu_dmel_embryo,
  insitu_discovery_function = discovery.log, saturate = 500,
  prior = prior.temporal_proximity_is_good)

Arguments

`seq_signature`	A named vector containing FPKM RNAseq data. Each element name must correspond to the names used in the `insitu` argument. NAs are permitted.
`depth`	Number of RNA in situ expression patterns to combine to identify mixed populations. If 1, the expression patterns as given are used. Otherwise all combinations of `depth` expression patterns are tried. Each term combined with itself is also tested i.e. pure populations will still be identified if depth>1. Defaults to 2. `seqVsInsitu` Depths > 2 can be slow. `iterating_seqVsInsitu` is much faster in these cases.
`insitu`	Matrix with RNA in situ hybridisation results. Rows are transcript names (same names as used for `seq_signature`) and coloumns are anatomical terms (possibly combined with developmental stages). 1 denotes staining of a particular transcript in a particular tissue, 0 denotes no staining. Defaults to `BDGP_insitu_dmel_embryo`, a staining dataset for Drosophila melanogaster embryos.
`insitu_discovery_function`	A function that converts FPKM values to the probability of discovery by RNA in situ hybridisation. Probabilities must be ]0..1[, the values 0 and 1 are not permitted. Defaults to `discovery.log`, an approximation of empirically determined discovery probabilities. Other available functions are `discovery.linear` and `discovery.identic`.
`saturate`	Will be passed on to the `insitu_discovery_function`. The data set dependent maximum value at which discovery probability should saturate. Defaults to 500 (FPKM).
`prior`	A function that returns the log2 prior probability of each anatomic term or combination of terms. Defaults to `prior.temporal_proximity_is_good`, which works well with `BDGP_insitu_dmel_embryo`. `prior.all_equal` assumes that all terms are equally probable.

Details

First, the function calculates for each sequenced transcript how likely it is that it would produce an RNA in situ signal, given its expresion strength. Using these staining probabilities and Bayes's rule the function then calculates the probability score for each of the given RNA in situ hybridisation patterns that it was produced by the same gene expression pattern as the sequenced transcriptome.

If depth>1 then the function identifies the origins of not pure sequenced material. For that it merges multiple RNA in situ hybridisation patterns for comparison with the sequenced data. This simulates the outcome of cell populations mixing.

seq_signature is best generated by taking the mean coverage of the regions which are actually tested with the RNA in situ hybridisation probes. This circumvents problems from misannotation, overlapping transcripts and faulty quantitation of individual transcripts from sequencing data. A protocol for generating such datasets is given in the package reference.

Value

A matrix with a row for each anatomical term (or combination of terms) and at least four columns. The terms are sorted by the posterior value and the top term is the most likely source of the RNAseq transcriptome.

`posterior`	A log2 posterior probability score. The highest value is given to the most likely tissue of origin. The value is only meaningful in comparison with other values within the same result set.
`prior`	Prior probability of the anatomical term(s), as given by the function `prior`.
`likelihood.from.absence.insitu`	Probability score from all the genes where RNA in situ hybridisation did not report staining.
`likelihood.from.presence.insitu`	Probability score from all the genes where in situ hybridisation reported staining.
`remaining coloumns`	Number of additional expressed genes added to the in situ signature with each term in the tested combination. Sometimes additional terms add only very few or no new genes at all. Such tissue contributions are meaningless artefacts.

The posterior column is the sum of the other three named columns. The scores are proportional to the (unknown) probabilities of identity.

Examples

fpath <- system.file("extdata", "vncMedianCoverage.tsv", package="cellOrigins")
vncExpression <- read.delim(file = fpath, header=FALSE, as.is=TRUE)

expression <- vncExpression$V2
names(expression) <- vncExpression$V1

result <- seqVsInsitu(expression, depth=1)

[Package cellOrigins version 0.1.3 Index]