similarity-methods {arulesSequences} | R Documentation |
Compute Similarities
Description
Provides the generic function similarity
and the S4 method
to compute similarities among a collection of sequences.
is.subset, is.superset
find subsequence or supersequence
relationships among a collection of sequences.
Usage
similarity(x, y = NULL, ...)
## S4 method for signature 'sequences'
similarity(x, y = NULL,
method = c("jaccard", "dice", "cosine", "subset"),
strict = FALSE)
## S4 method for signature 'sequences'
is.subset(x, y = NULL, proper = FALSE)
## S4 method for signature 'sequences'
is.superset(x, y = NULL, proper = FALSE)
Arguments
x , y |
an object. |
... |
further (unused) arguments. |
method |
a string specifying the similarity measure to use (see details). |
strict |
a logical value specifying if strict itemset matching should be used. |
proper |
a logical value specifying if only strict relationships (omitting equality) should be indicated. |
Details
Let the number of common elements of two sequences refer to those that occur in a longest common subsequence. The following similarity measures are implemented:
jaccard
:The number of common elements divided by the total number of elements (the sum of the lengths of the sequences minus the length of the longest common subsequence).
dice
:Uses two times the number of common elements.
cosine
:Uses the square root of the product of the sequence lengths for the denominator.
subset
:Zero if the first sequence is not a subsequence of the second. Otherwise the number of common elements divided by the number of elements in the first sequence.
If strict = TRUE
the elements (itemsets) of the sequences must
be equal to be matched. Otherwise matches are quantified by the
similarity of the itemsets (as specified by method
) thresholded
at 0.5, and the common sequence by the sum of the similarities.
Value
For similarity
, returns an object of class
dsCMatrix
if the result
is symmetric (or method = "subset"
) and and object of
class dgCMatrix
otherwise.
For is.subset, is.superset
returns an object of class
lgCMatrix
.
Note
Computation of the longest common subsequence of two sequences of
length n, m
takes O(n*m)
time.
The supported set of operations for the above matrix classes depends
on package Matrix. In case of problems, expand to full storage
representation using as(x, "matrix")
or as.matrix(x)
.
For efficiency use as(x, "dist")
to convert a symmetric
result matrix for clustering.
Author(s)
Christian Buchta
See Also
Class
sequences
,
method
dissimilarity
.
Examples
## use example data
data(zaki)
z <- as(zaki, "timedsequences")
similarity(z)
# require equality
similarity(z, strict = TRUE)
## emphasize common
similarity(z, method = "dice")
##
is.subset(z)
is.subset(z, proper = TRUE)