observedMutations {shazam}R Documentation

Calculate observed numbers of mutations

Description

observedMutations calculates the observed number of mutations for each sequence in the input data.frame.

Usage

observedMutations(
  db,
  sequenceColumn = "sequence_alignment",
  germlineColumn = "germline_alignment_d_mask",
  regionDefinition = NULL,
  mutationDefinition = NULL,
  ambiguousMode = c("eitherOr", "and"),
  frequency = FALSE,
  combine = FALSE,
  nproc = 1,
  cloneColumn = "clone_id",
  juncLengthColumn = "junction_length"
)

Arguments

db

data.frame containing sequence data.

sequenceColumn

character name of the column containing input sequences. IUPAC ambiguous characters for DNA are supported.

germlineColumn

character name of the column containing the germline or reference sequence. IUPAC ambiguous characters for DNA are supported.

regionDefinition

RegionDefinition object defining the regions and boundaries of the Ig sequences. If NULL, mutations are counted for entire sequence. To use regions definitions, sequences in sequenceColum and germlineColumn must be aligned, following the IMGT schema.

mutationDefinition

MutationDefinition object defining replacement and silent mutation criteria. If NULL then replacement and silent are determined by exact amino acid identity.

ambiguousMode

whether to consider ambiguous characters as "either or" or "and" when determining and counting the type(s) of mutations. Applicable only if sequenceColumn and/or germlineColumn contain(s) ambiguous characters. One of c("eitherOr", "and"). Default is "eitherOr".

frequency

logical indicating whether or not to calculate mutation frequencies. Default is FALSE.

combine

logical indicating whether for each sequence should the mutation counts for the different regions (CDR, FWR) and mutation types be combined and return one value of count/frequency per sequence instead of multiple values. Default is FALSE.

nproc

number of cores to distribute the operation over. If the cluster has already been set the call function with nproc = 0 to not reset or reinitialize. Default is nproc = 1.

cloneColumn

clone id column name in db

juncLengthColumn

junction length column name in db

Details

Mutation counts are determined by comparing a reference sequence to the input sequences in the column specified by sequenceColumn. See calcObservedMutations for more technical details, including criteria for which sequence differences are included in the mutation counts and which are not.

The mutations are binned as either replacement (R) or silent (S) across the different regions of the sequences as defined by regionDefinition. Typically, this would be the framework (FWR) and complementarity determining (CDR) regions of IMGT-gapped nucleotide sequences. Mutation counts are appended to the input db as additional columns.

If db includes lineage information, such as the parent_sequence column created by makeGraphDf, the reference sequence can be set to use that field as reference sequence using the germlineColumn argument.

Value

A modified db data.frame with observed mutation counts for each sequence listed. The columns names are dynamically created based on the regions in the regionDefinition. For example, when using the IMGT_V definition, which defines positions for CDR and FWR, the following columns are added:

If frequency=TRUE, R and S mutation frequencies are calculated over the number of non-N positions in the specified regions.

If frequency=TRUE and combine=TRUE, the mutations and non-N positions are aggregated and a single mu_freq value is returned

See Also

calcObservedMutations is called by this function to get the number of mutations in each sequence grouped by the RegionDefinition. See IMGT_SCHEMES for a set of predefined RegionDefinition objects. See expectedMutations for calculating expected mutation frequencies. See makeGraphDf for creating the field parent_sequence.

Examples

# Subset example data
data(ExampleDb, package="alakazam")
db <- ExampleDb[1:10, ]

# Calculate mutation frequency over the entire sequence
db_obs <- observedMutations(db, sequenceColumn="sequence_alignment",
                            germlineColumn="germline_alignment_d_mask",
                            frequency=TRUE,
                            nproc=1)

# Count of V-region mutations split by FWR and CDR
# With mutations only considered replacement if charge changes
db_obs <- observedMutations(db, sequenceColumn="sequence_alignment",
                            germlineColumn="germline_alignment_d_mask",
                            regionDefinition=IMGT_V,
                            mutationDefinition=CHARGE_MUTATIONS,
                            nproc=1)
                            
# Count of VDJ-region mutations, split by FWR and CDR
db_obs <- observedMutations(db, sequenceColumn="sequence_alignment",
                            germlineColumn="germline_alignment_d_mask",
                            regionDefinition=IMGT_VDJ,
                            nproc=1)
                            
# Extend data with lineage information
data(ExampleTrees, package="alakazam")
graph <- ExampleTrees[[17]]
clone <- alakazam::makeChangeoClone(subset(ExampleDb, clone_id == graph$clone))
gdf <- makeGraphDf(graph, clone)

# Count of mutations between observed sequence and immediate ancenstor
db_obs <- observedMutations(gdf, sequenceColumn="sequence",
                            germlineColumn="parent_sequence",
                            regionDefinition=IMGT_VDJ,
                            nproc=1)    
    

[Package shazam version 1.2.0 Index]