R: Plots the distribution of the observed to expected expression...

plotMarkerDistribution {SoupX}

R Documentation

Plots the distribution of the observed to expected expression for marker genes

Description

If each cell were made up purely of background reads, the expression fraction would equal that of the soup. This plot compares this expectation of pure background to the observed expression fraction in each cell, for each of the groups of genes in nonExpressedGeneList. For each group of genes, the distribution of this ratio is plotted across all cells. A value significantly greater than 1 (0 on log scale) can only be obtained if some of the genes in each group are genuinely expressed by the cell. That is, the assumption that the cell is pure background does not hold for that gene.

Usage

plotMarkerDistribution(
  sc,
  nonExpressedGeneList,
  maxCells = 150,
  tfidfMin = 1,
  ...
)

Arguments

`sc`	A SoupChannel object.
`nonExpressedGeneList`	Which sets of genes to use to estimate soup (see `calculateContaminationFraction`).
`maxCells`	Randomly plot only this many cells to prevent over-crowding.
`tfidfMin`	Minimum specificity cut-off used if finding marker genes (see `quickMarkers`).
`...`	Passed to `estimateNonExpressingCells`

Details

This plot is a useful diagnostic for the assumption that a list of genes is non-expressed in most cell types. For non-expressed cells, the ratio should cluster around the contamination fraction, while for expressed cells it should be elevated. The most useful non-expressed gene sets are those for which the genes are either strongly expressed, or not expressed at all. Such groups of genes will show up in this plot as a bimodal distribution, with one mode containing the cells that do not express these genes around the contamination fraction for this channel and another around a value at some value equal to or greater than 0 (1 on non-log scale) for the expressed cells.

The red line shows the global estimate of the contamination for each group of markers. This is usually lower than the low mode of the distribution as there will typically be a non-negligible number of cells with 0 observed counts (and hence -infinity log ratio).

If nonExpressedGeneList is missing, this function will try and find genes that are very specific to different clusters, as these are often the most useful in estimating the contamination fraction. This is meant only as a heuristic, which can hopefully provide some inspiration as to a class of genes to use to estimation the contamination for your experiment. Please do **NOT** blindly use the top N genes found in this way to estimate the contamination. That is, do not feed this list of genes into calculateContaminationFraction without any manual consideration or filtering as this *will over-estimate your contamination* (often by a large amount). For this reason, these gene names are not returned by the function.

Value

A ggplot2 object containing the plot.

Examples

gg = plotMarkerDistribution(scToy,list(CD7='CD7',LTB='LTB'))

[Package SoupX version 1.6.2 Index]