autoEstCont {SoupX}R Documentation

Automatically calculate the contamination fraction

Description

The idea of this method is that genes that are highly expressed in the soup and are marker genes for some population can be used to estimate the background contamination. Marker genes are identified using the tfidf method (see quickMarkers). The contamination fraction is then calculated at the cluster level for each of these genes and clusters are then aggressively pruned to remove those that give implausible estimates.

Usage

autoEstCont(
  sc,
  topMarkers = NULL,
  tfidfMin = 1,
  soupQuantile = 0.9,
  maxMarkers = 100,
  contaminationRange = c(0.01, 0.8),
  rhoMaxFDR = 0.2,
  priorRho = 0.05,
  priorRhoStdDev = 0.1,
  doPlot = TRUE,
  forceAccept = FALSE,
  verbose = TRUE
)

Arguments

sc

The SoupChannel object.

topMarkers

A data.frame giving marker genes. Must be sorted by decreasing specificity of marker and include a column 'gene' that contains the gene name. If set to NULL, markers are estimated using quickMarkers.

tfidfMin

Minimum value of tfidf to accept for a marker gene.

soupQuantile

Only use genes that are at or above this expression quantile in the soup. This prevents inaccurate estimates due to using genes with poorly constrained contribution to the background.

maxMarkers

If we have heaps of good markers, keep only the best maxMarkers of them.

contaminationRange

Vector of length 2 that constrains the contamination fraction to lie within this range. Must be between 0 and 1. The high end of this range is passed to estimateNonExpressingCells as maximumContamination.

rhoMaxFDR

False discovery rate passed to estimateNonExpressingCells, to test if rho is less than maximumContamination.

priorRho

Mode of gamma distribution prior on contamination fraction.

priorRhoStdDev

Standard deviation of gamma distribution prior on contamination fraction.

doPlot

Create a plot showing the density of estimates?

forceAccept

Passed to setContaminationFraction. Should we allow very high contamination fractions to be used.

verbose

Be verbose?

Details

This set of marker genes is filtered to include only those with tf-idf value greater than tfidfMin. A higher tf-idf value implies a more specific marker. Specifically a cut-off t implies that a marker gene has the property that geneFreqGlobal < exp(-t/geneFreqInClust). See quickMarkers. It may be necessary to decrease this value for data sets with few good markers.

This set of marker genes is filtered down to include only the genes that are highly expressed in the soup, controlled by the soupQuantile parameter. Genes highly expressed in the soup provide a more precise estimate of the contamination fraction.

The pruning of implausible clusters is based on a call to estimateNonExpressingCells. The parameters maximumContamination=max(contaminationRange) and rhoMaxFDR are passed to this function. The defaults set here are calibrated to aggressively prune anything that has even the weakest of evidence that it is genuinely expressed.

For each cluster/gene pair the posterior distribution of the contamination fraction is calculated (based on gamma prior, controlled by priorRho and priorRhoStdDev). These posterior distributions are aggregated to produce a final estimate of the contamination fraction. The logic behind this is that estimates from clusters that truly estimate the contamination fraction will cluster around the true value, while erroneous estimates will be spread out across the range (0,1) without a 'preferred value'. The most probable value of the contamination fraction is then taken as the final global contamination fraction.

Value

A modified SoupChannel object where the global contamination rate has been set. Information about the estimation is also stored in the slot fit

Note

This function assumes that the channel contains multiple distinct cell types with different marker genes. If you try and run it on a channel with very homogenous cells (e.g. a cell line, flow-sorted cells), you will likely get a warning, an error, and/or an extremely high contamination estimate. In such circumstances your best option is usually to manually set the contamination to something reasonable.

See Also

quickMarkers

Examples

#Use less specific markers
scToy = autoEstCont(scToy,tfidfMin=0.8)
#Allow large contamination fractions to be allocated
scToy = autoEstCont(scToy,forceAccept=TRUE)
#Be quiet
scToy = autoEstCont(scToy,verbose=FALSE,doPlot=FALSE)

[Package SoupX version 1.6.2 Index]