collapseClones {shazam}R Documentation

Constructs effective clonal sequences for all clones

Description

collapseClones creates effective input and germline sequences for each clonal group and appends columns containing the consensus sequences to the input data.frame.

Usage

collapseClones(
  db,
  cloneColumn = "clone_id",
  sequenceColumn = "sequence_alignment",
  germlineColumn = "germline_alignment_d_mask",
  muFreqColumn = NULL,
  regionDefinition = NULL,
  method = c("mostCommon", "thresholdedFreq", "catchAll", "mostMutated", "leastMutated"),
  minimumFrequency = NULL,
  includeAmbiguous = FALSE,
  breakTiesStochastic = FALSE,
  breakTiesByColumns = NULL,
  expandedDb = FALSE,
  nproc = 1,
  juncLengthColumn = "junction_length",
  fields = NULL
)

Arguments

db

data.frame containing sequence data. Required.

cloneColumn

character name of the column containing clonal identifiers. Required.

sequenceColumn

character name of the column containing input sequences. Required. The length of each input sequence should match that of its corresponding germline sequence.

germlineColumn

character name of the column containing germline sequences. Required. The length of each germline sequence should match that of its corresponding input sequence.

muFreqColumn

character name of the column containing mutation frequency. Optional. Applicable to the "mostMutated" and "leastMutated" methods. If not supplied, mutation frequency is computed by calling observedMutations. Default is NULL. See Cautions for note on usage.

regionDefinition

RegionDefinition object defining the regions and boundaries of the Ig sequences. Optional. Default is NULL.

method

method for calculating input consensus sequence. Required. One of "thresholdedFreq", "mostCommon", "catchAll", "mostMutated", or "leastMutated". See "Methods" for details.

minimumFrequency

frequency threshold for calculating input consensus sequence. Applicable to and required for the "thresholdedFreq" method. A canonical choice is 0.6. Default is NULL.

includeAmbiguous

whether to use ambiguous characters to represent positions at which there are multiple characters with frequencies that are at least minimumFrequency or that are maximal (i.e. ties). Applicable to and required for the "thresholdedFreq" and "mostCommon" methods. Default is FALSE. See "Choosing ambiguous characters" for rules on choosing ambiguous characters.

breakTiesStochastic

In case of ties, whether to randomly pick a sequence from sequences that fulfill the criteria as consensus. Applicable to and required for all methods except for "catchAll". Default is FALSE. See "Methods" for details.

breakTiesByColumns

A list of the form list(c(col_1, col_2, ...), c(fun_1, fun_2, ...)), where col_i is a character name of a column in db, and fun_i is a function to be applied on that column. Currently, only max and min are supported. Note that the two c()'s in list() are essential (i.e. if there is only 1 column, the list should be of the form list(c(col_1), c(func_1)). Applicable to and optional for the "mostMutated" and "leastMutated" methods. If supplied, fun_i's are applied on col_i's to help break ties. Default is NULL. See "Methods" for details.

expandedDb

logical indicating whether or not to return the expanded db, containing all the sequences (as opposed to returning just one sequence per clone).

nproc

Number of cores to distribute the operation over. If the cluster has already been set earlier, then pass the cluster. This will ensure that it is not reset.

juncLengthColumn

character name of the column containing the junction length. Needed when regionDefinition includes CDR3 and FWR4.

fields

additional fields used for grouping. Use sample_id, to avoid combining sequences with the same clone_id that belong to different sample_id.

Value

A modified db with the following additional columns:

clonal_sequence is generated with the method of choice indicated by method, and clonal_germline is generated with the "mostCommon" method, along with, where applicable, user-defined parameters such as minimumFrequency, includeAmbiguous, breakTiesStochastic, and breakTiesByColumns.

Consensus lengths

For each clone, clonal_sequence and clonal_germline have the same length.

Methods

The descriptions below use "sequences" as a generalization of input sequences and germline sequences.

Choosing ambiguous characters

Ambiguous characters may be present in the returned consensuses when using the "catchAll" method and when using the "thresholdedFreq" or "mostCommon" methods with includeAmbiguous=TRUE.

The rules on choosing ambiguous characters are as follows:

Cautions

See Also

See IMGT_SCHEMES for a set of predefined RegionDefinition objects.

Examples

# Subset example data
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, c_call %in% c("IGHA", "IGHG") & sample_id == "+7d" &
                        clone_id %in% c("3100", "3141", "3184"))

# thresholdedFreq method, resolving ties deterministically without using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment", 
                         germlineColumn="germline_alignment_d_mask",
                         method="thresholdedFreq", minimumFrequency=0.6,
                         includeAmbiguous=FALSE, breakTiesStochastic=FALSE)

# mostCommon method, resolving ties deterministically using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment", 
                         germlineColumn="germline_alignment_d_mask",
                         method="mostCommon", 
                         includeAmbiguous=TRUE, breakTiesStochastic=FALSE)

# Make a copy of db that has a mutation frequency column
db2 <- observedMutations(db, frequency=TRUE, combine=TRUE)

# mostMutated method, resolving ties stochastically
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment", 
                         germlineColumn="germline_alignment_d_mask",
                         method="mostMutated", muFreqColumn="mu_freq", 
                         breakTiesStochastic=TRUE, breakTiesByColumns=NULL)
                         
# mostMutated method, resolving ties deterministically using additional columns
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment", 
                         germlineColumn="germline_alignment_d_mask",
                         method="mostMutated", muFreqColumn="mu_freq", 
                         breakTiesStochastic=FALSE, 
                         breakTiesByColumns=list(c("duplicate_count"), c(max)))

# Build consensus for V segment only
# Capture all nucleotide variations using ambiguous characters 
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment", 
                         germlineColumn="germline_alignment_d_mask",
                         method="catchAll", regionDefinition=IMGT_V)

# Return the same number of rows as the input
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment", 
                         germlineColumn="germline_alignment_d_mask",
                         method="mostCommon", expandedDb=TRUE)


[Package shazam version 1.2.0 Index]