R: Classify cells on previously defined rules

classify {cellpypes}

R Documentation

Classify cells on previously defined rules

Description

Classify cells on previously defined rules

Usage

classify(
  obj,
  classes = NULL,
  knn_refine = 0,
  replace_overlap_with = "Unassigned",
  return_logical_matrix = FALSE,
  overdispersion = 0.01
)

Arguments

`obj`	A cellpypes object, see section cellpypes Objects below.
`classes`	Character vector with one or more class names. If NULL (the default), plots finest available cell types (all classes that are not parent of any other class).
`knn_refine`	Numeric between 0 and 1. If 0, do not refine labels obtained from UMI count pooling. If larger than 0 (recommended: 0.1), cellpypes will try to label unassigned cells by majority vote, see section knn_refine below.
`replace_overlap_with`	Character string, by default: `"Unassigned"`. See section Handling overlap.
`return_logical_matrix`	logical. If TRUE, a logical matrix with classes in columns and cells in rows is returned instead of resolving overlaps with `replace_overlap_with`. If a single class is supplied, the matrix has exactly one column and the user can pipe it into "drop" to convert it to a vector.
`overdispersion`	Defaults to 0.01, only change it if you know what you are doing. If set to 0, the NB simplifies to the Poisson distribution, and larger values give more variance. The 0.01 default value follows the recommendation by Lause, Berens and Kobak (Genome Biology 2021) to use `size=100` in pnbinom for typical data sets.

Value

A factor with cell type labels.

cellpypes Objects

A cellpypes object is a list with four slots:

raw

(sparse) matrix with genes in rows, cells in columns

totalUMI

the colSums of obj$raw

embed

two-dimensional embedding of the cells, provided as data.frame or tibble with two columns and one row per cell.

neighbors

index matrix with one row per cell and k columns, where k is the number of nearest neighbors (we recommend 15<k<100, e.g. k=50). Here are two ways to get the neighbors index matrix:

Use find_knn(featureMatrix)$idx, where featureMatrix could be principal components, latent variables or normalized genes (features in rows, cells in columns).
use as(seurat@graphs[["RNA_nn"]], "dgCMatrix")> .1 to extract the kNN graph computed on RNA. The > .1 ensures this also works with RNA_snn, wknn/wsnn or any other available graph – check with names(seurat@graphs).

Handling overlap

Overlap denotes all cells for which rules from multiple classes apply, and these cells will be labeled as Unassigned by default. If you are in fact interested in where the overlap is, set return_logical_matrix=TRUE and inspect the result. Note that it matters whether you call classify("Tcell") or classify(c("Tcell","Bcell") – any existing overlap between T and B cells is labelled as Unassigned in this second call, but not in the first.

Replacing overlap happens only between mutually exclusive labels (such as Tcell and Bcell), but not within a lineage. To make an example, overlap is NOT replaced between child (PD1+Ttox) and parent (Ttox) or any other ancestor (Tcell), but instead the most detailed cell type (PD1+Ttox) is returned.

All of the above is also true for plot_classes, as it wraps classify.

knn_refine

With knn_refine > 0, cellpypes refines cell type labels with a kNN classifier.

By default, cellpypes only assigns cells to a class if all relevant rules apply. In other words, all marker gene UMI counts in the cell's neighborhood all have to be clearly above/below their threshold. Since UMI counts are sparse (even after neighbor pooling done by cellpypes), this can leave many cells unassigned.

It is reasonable to assume an unassigned cell is of the same cell type as the majority of its nearest neighbors. Therefore, cellpypes implements a kNN classifier to further refine labels obtained by manually thresholding UMI counts. knn_refine = 0.3 means a cell is assigned the class label held by most of its neighbors unless no class gets more than 30 %. If most neighbors are unassigned, the cell will also be set to "Unassigned". Choosing knn_refine = 0.3 gives results reminiscent of clustering (which assigns all cells), while knn_refine = 0.5 leaves cells 'in between' two similar cell types unassigned.

We recommend looking at knn_refine = 0 first as it's faster and more directly tied to marker gene expression. If assigning all cells is desired, we recommend knn_refine = 0.3 or lower, while knn_refine = 0.5 makes cell types more 'crisp' by setting cells 'in between' related subtypes to "Unassigned".

Examples

classify(rule(simulated_umis, "Tcell", "CD3E", ">", 1))

[Package cellpypes version 0.3.0 Index]