collapseClones {shazam} | R Documentation |
Constructs effective clonal sequences for all clones
Description
collapseClones
creates effective input and germline sequences for each clonal
group and appends columns containing the consensus sequences to the input
data.frame
.
Usage
collapseClones(
db,
cloneColumn = "clone_id",
sequenceColumn = "sequence_alignment",
germlineColumn = "germline_alignment_d_mask",
muFreqColumn = NULL,
regionDefinition = NULL,
method = c("mostCommon", "thresholdedFreq", "catchAll", "mostMutated", "leastMutated"),
minimumFrequency = NULL,
includeAmbiguous = FALSE,
breakTiesStochastic = FALSE,
breakTiesByColumns = NULL,
expandedDb = FALSE,
nproc = 1,
juncLengthColumn = "junction_length",
fields = NULL
)
Arguments
db |
|
cloneColumn |
|
sequenceColumn |
|
germlineColumn |
|
muFreqColumn |
|
regionDefinition |
RegionDefinition object defining the regions
and boundaries of the Ig sequences. Optional. Default is
|
method |
method for calculating input consensus sequence. Required.
One of |
minimumFrequency |
frequency threshold for calculating input consensus sequence.
Applicable to and required for the |
includeAmbiguous |
whether to use ambiguous characters to represent positions
at which there are multiple characters with frequencies that
are at least |
breakTiesStochastic |
In case of ties, whether to randomly pick a sequence from
sequences that fulfill the criteria as consensus. Applicable
to and required for all methods except for |
breakTiesByColumns |
A list of the form
|
expandedDb |
|
nproc |
Number of cores to distribute the operation over. If the
|
juncLengthColumn |
|
fields |
additional fields used for grouping. Use sample_id, to avoid combining sequences with the same clone_id that belong to different sample_id. |
Value
A modified db
with the following additional columns:
-
clonal_sequence
: effective sequence for the clone. -
clonal_germline
: germline sequence for the clone. -
clonal_sequence_mufreq
: mutation frequency ofclonal_sequence
; only added for the"mostMutated"
and"leastMutated"
methods.
clonal_sequence
is generated with the method of choice indicated
by method
, and clonal_germline
is generated with the
"mostCommon"
method, along with, where applicable, user-defined
parameters such as minimumFrequency
, includeAmbiguous
,
breakTiesStochastic
, and breakTiesByColumns
.
Consensus lengths
For each clone, clonal_sequence
and
clonal_germline
have the same length.
For the
"thresholdedFreq"
,"mostCommon"
, and"catchAll"
methods:The length of the consensus sequences is determined by the longest possible consensus sequence (baesd on
inputSeq
andgermlineSeq
) andregionDefinition@seqLength
(if supplied), whichever is shorter.Given a set of sequences of potentially varying lengths, the longest possible length of their consensus sequence is taken to be the longest length along which there is information contained at every nucleotide position across majority of the sequences. Majority is defined to be greater than
floor(n/2)
, wheren
is the number of sequences. If the longest possible consensus length is 0, there will be a warning and an empty string (""
) will be returned.If a length limit is defined by supplying a
regionDefinition
viaregionDefinition@seqLength
, the consensus length will be further restricted to the shorter of the longest possible length andregionDefinition@seqLength
.For the
"mostMutated"
and"leastMutated"
methods:The length of the consensus sequences depends on that of the most/least mutated input sequence, and, if supplied, the length limit defined by
regionDefinition@seqLength
, whichever is shorter. If the germline consensus computed using the"mostCommon"
method is longer than the most/least mutated input sequence, the germline consensus is trimmed to be of the same length as the input consensus.
Methods
The descriptions below use "sequences" as a generalization of input sequences and germline sequences.
-
method="thresholdedFreq"
A threshold must be supplied to the argument
minimumFrequency
. At each position along the length of the consensus sequence, the frequency of each nucleotide/character across sequences is tabulated. The nucleotide/character whose frequency is at least (i.e.>=
)minimumFrequency
becomes the consensus; if there is none, the consensus nucleotide will be"N"
.When there are ties (frequencies of multiple nucleotides/characters are at least
minimumFrequency
), this method can be deterministic or stochastic, depending on additional parameters.With
includeAmbiguous=TRUE
, ties are resolved deterministically by representing ties using ambiguous characters. See "Choosing ambiguous characters" for how ambiguous characters are chosen.With
breakTiesStochastic=TRUE
, ties are resolved stochastically by randomly picking a character amongst the ties.When both
TRUE
,includeAmbiguous
takes precedence overbreakTiesStochastic
.When both
FALSE
, the first character from the ties is taken to be the consensus following the order of"A"
,"T"
,"G"
,"C"
,"N"
,"."
, and"-"
.
Below are some examples looking at a single position based on 5 sequences with
minimumFrequency=0.6
,includeAmbiguous=FALSE
, andbreakTiesStochastic=FALSE
:If the sequences have
"A"
,"A"
,"A"
,"T"
,"C"
, the consensus will be"A"
, because"A"
has frequency 0.6, which is at leastminimumFrequency
.If the sequences have
"A"
,"A"
,"T"
,"T"
,"C"
, the consensus will be"N"
, because none of"A"
,"T"
, or"C"
has frequency that is at leastminimumFrequency
.
-
method="mostCommon"
The most frequent nucleotide/character across sequences at each position along the length of the consensus sequence makes up the consensus.
When there are ties (multiple nucleotides/characters with equally maximal frequencies), this method can be deterministic or stochastic, depending on additional parameters. The same rules for breaking ties for
method="thresholdedFreq"
apply.Below are some examples looking at a single position based on 5 sequences with
includeAmbiguous=FALSE
, andbreakTiesStochastic=FALSE
:If the sequences have
"A"
,"A"
,"T"
,"A"
,"C"
, the consensus will be"A"
.If the sequences have
"T"
,"T"
,"C"
,"C"
,"G"
, the consensus will be"T"
, because"T"
is before"C"
in the order of"A"
,"T"
,"G"
,"C"
,"N"
,"."
, and"-"
.
-
method="catchAll"
This method returns a consensus sequence capturing most of the information contained in the sequences. Ambiguous characters are used where applicable. See "Choosing ambiguous characters" for how ambiguous characters are chosen. This method is deterministic and does not involve breaking ties.
Below are some examples for
method="catchAll"
looking at a single position based on 5 sequences:If the sequences have
"N"
,"N"
,"N"
,"N"
,"N"
, the consensus will be"N"
.If the sequences have
"N"
,"A"
,"A"
,"A"
,"A"
, the consensus will be"A"
.If the sequences have
"N"
,"A"
,"G"
,"A"
,"A"
, the consensus will be"R"
.If the sequences have
"-"
,"-"
,"."
,"."
,"."
, the consensus will be"-"
.If the sequences have
"-"
,"-"
,"-"
,"-"
,"-"
, the consensus will be"-"
.If the sequences have
"."
,"."
,"."
,"."
,"."
, the consensus will be"."
.
-
method="mostMutated"
andmethod="leastMutated"
These methods return the most/least mutated sequence as the consensus sequence.
When there are ties (multple sequences have the maximal/minimal mutation frequency), this method can be deterministic or stochastic, depending on additional parameters.
With
breakTiesStochastic=TRUE
, ties are resolved stochastically by randomly picking a sequence out of sequences with the maximal/minimal mutation frequency.When
breakTiesByColumns
is supplied, ties are resolved deterministically. Column by column, a function is applied on the column and sequences with column value matching the functional value are retained, until ties are resolved or columns run out. In the latter case, the first remaining sequence is taken as the consensus.When
breakTiesStochastic=TRUE
andbreakTiesByColumns
is also supplied,breakTiesStochastic
takes precedence overbreakTiesByColumns
.When
breakTiesStochastic=FALSE
andbreakTiesByColumns
is not supplied (i.e.NULL
), the sequence that appears first amongst the ties is taken as the consensus.
Choosing ambiguous characters
Ambiguous characters may be present in the returned consensuses when using the
"catchAll"
method and when using the "thresholdedFreq"
or
"mostCommon"
methods with includeAmbiguous=TRUE
.
The rules on choosing ambiguous characters are as follows:
If a position contains only
"N"
across sequences, the consensus at that position is"N"
.If a position contains one or more of
"A"
,"T"
,"G"
, or"C"
, the consensus will be an IUPAC character representing all of the characters present, regardless of whether"N"
,"-"
, or"."
is present.If a position contains only
"-"
and"."
across sequences, the consensus at that position is taken to be"-"
.If a position contains only one of
"-"
or"."
across sequences, the consensus at that position is taken to be the character present.
Cautions
Note that this function does not perform multiple sequence alignment. As a prerequisite, it is assumed that the sequences in
sequenceColumn
andgermlineColumn
have been aligned somehow. In the case of immunoglobulin repertoire analysis, this usually means that the sequences are IMGT-gapped.When using the
"mostMutated"
and"leastMutated"
methods, if you supply bothmuFreqColumn
andregionDefinition
, it is your responsibility to ensure that the mutation frequency inmuFreqColumn
was calculated with sequence lengths restricted to the sameregionDefinition
you are supplying. Otherwise, the "most/least mutated" sequence you obtain might not be the most/least mutated given theregionDefinition
supplied, because your mutation frequency was based on aregionDefinition
different from the one supplied.If you intend to run
collapseClones
before building a 5-mer targeting model, you must choose parameters such that your collapsed clonal consensuses do not include ambiguous characters. This is because the targeting model functions do NOT support ambiguous characters in their inputs.
See Also
See IMGT_SCHEMES for a set of predefined RegionDefinition objects.
Examples
# Subset example data
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, c_call %in% c("IGHA", "IGHG") & sample_id == "+7d" &
clone_id %in% c("3100", "3141", "3184"))
# thresholdedFreq method, resolving ties deterministically without using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="thresholdedFreq", minimumFrequency=0.6,
includeAmbiguous=FALSE, breakTiesStochastic=FALSE)
# mostCommon method, resolving ties deterministically using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon",
includeAmbiguous=TRUE, breakTiesStochastic=FALSE)
# Make a copy of db that has a mutation frequency column
db2 <- observedMutations(db, frequency=TRUE, combine=TRUE)
# mostMutated method, resolving ties stochastically
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostMutated", muFreqColumn="mu_freq",
breakTiesStochastic=TRUE, breakTiesByColumns=NULL)
# mostMutated method, resolving ties deterministically using additional columns
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostMutated", muFreqColumn="mu_freq",
breakTiesStochastic=FALSE,
breakTiesByColumns=list(c("duplicate_count"), c(max)))
# Build consensus for V segment only
# Capture all nucleotide variations using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="catchAll", regionDefinition=IMGT_V)
# Return the same number of rows as the input
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon", expandedDb=TRUE)