similarity-methods {arulesSequences} R Documentation

## Compute Similarities

### Description

Provides the generic function `similarity` and the S4 method to compute similarities among a collection of sequences.

`is.subset, is.superset` find subsequence or supersequence relationships among a collection of sequences.

### Usage

```similarity(x, y = NULL, ...)

## S4 method for signature 'sequences'
similarity(x, y = NULL,
method = c("jaccard", "dice", "cosine", "subset"),
strict = FALSE)

## S4 method for signature 'sequences'
is.subset(x, y = NULL, proper = FALSE)
## S4 method for signature 'sequences'
is.superset(x, y = NULL, proper = FALSE)
```

### Arguments

 `x, y` an object. `...` further (unused) arguments. `method` a string specifying the similarity measure to use (see details). `strict` a logical value specifying if strict itemset matching should be used. `proper` a logical value specifying if only strict relationships (omitting equality) should be indicated.

### Details

Let the number of common elements of two sequences refer to those that occur in a longest common subsequence. The following similarity measures are implemented:

`jaccard`:

The number of common elements divided by the total number of elements (the sum of the lengths of the sequences minus the length of the longest common subsequence).

`dice`:

Uses two times the number of common elements.

`cosine`:

Uses the square root of the product of the sequence lengths for the denominator.

`subset`:

Zero if the first sequence is not a subsequence of the second. Otherwise the number of common elements divided by the number of elements in the first sequence.

If `strict = TRUE` the elements (itemsets) of the sequences must be equal to be matched. Otherwise matches are quantified by the similarity of the itemsets (as specified by `method`) thresholded at 0.5, and the common sequence by the sum of the similarities.

### Value

For `similarity`, returns an object of class `dsCMatrix` if the result is symmetric (or `method = "subset"`) and and object of class `dgCMatrix` otherwise.

For `is.subset, is.superset` returns an object of class `lgCMatrix`.

### Note

Computation of the longest common subsequence of two sequences of length `n, m` takes `O(n*m)` time.

The supported set of operations for the above matrix classes depends on package Matrix. In case of problems, expand to full storage representation using `as(x, "matrix")` or `as.matrix(x)`.

For efficiency use `as(x, "dist")` to convert a symmetric result matrix for clustering.

### Author(s)

Christian Buchta

Class `sequences`, method `dissimilarity`.

### Examples

```## use example data
data(zaki)
z <- as(zaki, "timedsequences")
similarity(z)

# require equality
similarity(z, strict = TRUE)

## emphasize common
similarity(z, method = "dice")

##
is.subset(z)
is.subset(z, proper = TRUE)
```

