findThreshold {shazam}R Documentation

Find distance threshold

Description

findThreshold automatically determines an optimal threshold for clonal assignment of Ig sequences using a vector of nearest neighbor distances. It provides two alternative methods using either a Gamma/Gaussian Mixture Model fit (method="gmm") or kernel density fit (method="density").

Usage

findThreshold(
  distances,
  method = c("density", "gmm"),
  edge = 0.9,
  cross = NULL,
  subsample = NULL,
  model = c("gamma-gamma", "gamma-norm", "norm-gamma", "norm-norm"),
  cutoff = c("optimal", "intersect", "user"),
  sen = NULL,
  spc = NULL,
  progress = FALSE
)

Arguments

distances

numeric vector containing nearest neighbor distances.

method

string defining the method to use for determining the optimal threshold. One of "gmm" or "density". See Details for methodological descriptions.

edge

upper range as a fraction of the data density to rule initialization of Gaussian fit parameters. Default value is 90 Applies only when method="density". .

cross

supplementary nearest neighbor distance vector output from distToNearest for initialization of the Gaussian fit parameters. Applies only when method="gmm".

subsample

maximum number of distances to subsample to before threshold detection.

model

allows the user to choose among four possible combinations of fitting curves: "norm-norm", "norm-gamma", "gamma-norm", and "gamma-gamma". Applies only when method="gmm".

cutoff

method to use for threshold selection: the optimal threshold "opt", the intersection point of the two fitted curves "intersect", or a value defined by user for one of the sensitivity or specificity "user". Applies only when method="gmm".

sen

sensitivity required. Applies only when method="gmm" and cutoff="user".

spc

specificity required. Applies only when method="gmm" and cutoff="user".

progress

if TRUE print a progress bar.

Details

Value

Note

Visually inspecting the resulting distribution fits is strongly recommended when using either fitting method. Empirical observations imply that the bimodality of the distance-to-nearest distribution is detectable for a minimum of 1,000 distances. Larger numbers of distances will improve the fitting procedure, although this can come at the expense of higher computational demands.

See Also

See distToNearest for generating the nearest neighbor distance vectors. See plotGmmThreshold and plotDensityThreshold for plotting output.

Examples


# Subset example data to 50 sequences, one sample and isotype as a demo
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, sample_id == "-1h" & c_call=="IGHG")[1:50,]

# Use nucleotide Hamming distance and normalize by junction length
db <- distToNearest(db, sequenceColumn="junction", vCallColumn="v_call",
                    jCallColumn="j_call", model="ham", normalize="len", nproc=1)
                            
# Find threshold using the "gmm" method with user defined specificity
output <- findThreshold(db$dist_nearest, method="gmm", model="gamma-gamma", 
                        cutoff="user", spc=0.99)
plot(output, binwidth=0.02, title=paste0(output@model, "   loglk=", output@loglk))
print(output)


[Package shazam version 1.2.0 Index]