workflowPartialMatch {distantia} | R Documentation |
Finds the section in a long sequence that better matches a short sequence.
Description
This workflow works under the following scenario: the user has a short sequence, and a long sequence, and has the objective of finding the segment in the long sequence that better matches the short sequence. The function identifies automatically the short and the long sequence, but throws an error if more than two sequences are introduced. The lengths of the segments in the long sequence to be compared with the long sequence are defined through the arguments min.length
and max.length
. If left empty, min.length
and max.length
equal 0, meaning that the segment to be searched for will have the same number of cases as the short sequence. Note that this is a brute force algorithm, can have a large memory footpring if the interval between min.length
and max.length
is too long. It might be convenient to pre-check the number of iterations to be performed by computing sum(nrow(long.sequence) - min.length:max.length) + 1
. The algorithm is parallelized and optimized as possible, so still, large searches are possible.
Usage
workflowPartialMatch(
sequences = NULL,
grouping.column = NULL,
time.column = NULL,
exclude.columns = NULL,
method = "manhattan",
diagonal = FALSE,
paired.samples = FALSE,
min.length = NULL,
max.length = NULL,
ignore.blocks = FALSE,
parallel.execution = TRUE
)
Arguments
sequences |
dataframe with multiple sequences identified by a grouping column generated by |
grouping.column |
character string, name of the column in |
time.column |
character string, name of the column with time/depth/rank data. |
exclude.columns |
character string or character vector with column names in |
method |
character string naming a distance metric. Valid entries are: "manhattan", "euclidean", "chi", and "hellinger". Invalid entries will throw an error. |
diagonal |
boolean, if |
paired.samples |
boolean, if |
min.length |
integer, minimum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If |
max.length |
integer, maximum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If |
ignore.blocks |
boolean. If |
parallel.execution |
boolean, if |
Value
A dataframe with three columns:
-
first.row first row of the segment in the long sequence matched against the short one.
-
last.row last row of the segment in the long sequence matched against the short one.
-
psi psi values, ordered from lower (máximum similarity / minimum dissimilarity) to higher.
Author(s)
Blas Benito <blasbenito@gmail.com>
Examples
#loading the data
data(sequencesMIS)
#removing grouping column
sequencesMIS$MIS <- NULL
#mock-up short sequence
MIS.short <- sequencesMIS[1:10, ]
#mock-up long sequence
MIS.long <- sequencesMIS[1:30, ]
#preparing sequences
MIS.sequences <- prepareSequences(
sequence.A = MIS.short,
sequence.A.name = "short",
sequence.B = MIS.long,
sequence.B.name = "long",
grouping.column = "id",
transformation = "hellinger"
)
#matching sequences
#min.length and max.length are
#minimal to speed up execution
MIS.psi <- workflowPartialMatch(
sequences = MIS.sequences,
grouping.column = "id",
time.column = NULL,
exclude.columns = NULL,
method = "manhattan",
diagonal = FALSE,
parallel.execution = FALSE
)
#output dataframe
MIS.psi