autoEstCont {SoupX} | R Documentation |
Automatically calculate the contamination fraction
Description
The idea of this method is that genes that are highly expressed in the soup and are marker genes for some population can be used to estimate the background contamination. Marker genes are identified using the tfidf method (see quickMarkers
). The contamination fraction is then calculated at the cluster level for each of these genes and clusters are then aggressively pruned to remove those that give implausible estimates.
Usage
autoEstCont(
sc,
topMarkers = NULL,
tfidfMin = 1,
soupQuantile = 0.9,
maxMarkers = 100,
contaminationRange = c(0.01, 0.8),
rhoMaxFDR = 0.2,
priorRho = 0.05,
priorRhoStdDev = 0.1,
doPlot = TRUE,
forceAccept = FALSE,
verbose = TRUE
)
Arguments
sc |
The SoupChannel object. |
topMarkers |
A data.frame giving marker genes. Must be sorted by decreasing specificity of marker and include a column 'gene' that contains the gene name. If set to NULL, markers are estimated using |
tfidfMin |
Minimum value of tfidf to accept for a marker gene. |
soupQuantile |
Only use genes that are at or above this expression quantile in the soup. This prevents inaccurate estimates due to using genes with poorly constrained contribution to the background. |
maxMarkers |
If we have heaps of good markers, keep only the best |
contaminationRange |
Vector of length 2 that constrains the contamination fraction to lie within this range. Must be between 0 and 1. The high end of this range is passed to |
rhoMaxFDR |
False discovery rate passed to |
priorRho |
Mode of gamma distribution prior on contamination fraction. |
priorRhoStdDev |
Standard deviation of gamma distribution prior on contamination fraction. |
doPlot |
Create a plot showing the density of estimates? |
forceAccept |
Passed to |
verbose |
Be verbose? |
Details
This set of marker genes is filtered to include only those with tf-idf value greater than tfidfMin
. A higher tf-idf value implies a more specific marker. Specifically a cut-off t implies that a marker gene has the property that geneFreqGlobal < exp(-t/geneFreqInClust). See quickMarkers
. It may be necessary to decrease this value for data sets with few good markers.
This set of marker genes is filtered down to include only the genes that are highly expressed in the soup, controlled by the soupQuantile
parameter. Genes highly expressed in the soup provide a more precise estimate of the contamination fraction.
The pruning of implausible clusters is based on a call to estimateNonExpressingCells
. The parameters maximumContamination=max(contaminationRange)
and rhoMaxFDR
are passed to this function. The defaults set here are calibrated to aggressively prune anything that has even the weakest of evidence that it is genuinely expressed.
For each cluster/gene pair the posterior distribution of the contamination fraction is calculated (based on gamma prior, controlled by priorRho
and priorRhoStdDev
). These posterior distributions are aggregated to produce a final estimate of the contamination fraction. The logic behind this is that estimates from clusters that truly estimate the contamination fraction will cluster around the true value, while erroneous estimates will be spread out across the range (0,1) without a 'preferred value'. The most probable value of the contamination fraction is then taken as the final global contamination fraction.
Value
A modified SoupChannel object where the global contamination rate has been set. Information about the estimation is also stored in the slot fit
Note
This function assumes that the channel contains multiple distinct cell types with different marker genes. If you try and run it on a channel with very homogenous cells (e.g. a cell line, flow-sorted cells), you will likely get a warning, an error, and/or an extremely high contamination estimate. In such circumstances your best option is usually to manually set the contamination to something reasonable.
See Also
quickMarkers
Examples
#Use less specific markers
scToy = autoEstCont(scToy,tfidfMin=0.8)
#Allow large contamination fractions to be allocated
scToy = autoEstCont(scToy,forceAccept=TRUE)
#Be quiet
scToy = autoEstCont(scToy,verbose=FALSE,doPlot=FALSE)