similarity {sets}R Documentation

Similarity and Dissimilarity Functions

Description

Similarities and dissimilarities for (generalized) sets.

Usage

set_similarity(x, y, method = "Jaccard")
gset_similarity(x, y, method = "Jaccard")
cset_similarity(x, y, method = "Jaccard")

set_dissimilarity(x, y,
                  method = c("Jaccard", "Manhattan", "Euclidean",
                             "L1", "L2"))
gset_dissimilarity(x, y,
                   method = c("Jaccard", "Manhattan", "Euclidean",
                              "L1", "L2"))
cset_dissimilarity(x, y,
                   method = c("Jaccard", "Manhattan", "Euclidean",
                              "L1", "L2"))

Arguments

x, y

Two (generalized/customizable) sets.

method

Character string specifying the proximity method (see below).

Details

For two generalized sets XX and YY, the Jaccard similarity is XY/XY|X \cap Y| / |X \cup Y| where |\cdot| denotes the cardinality for generalized sets (sum of memberships). The Jaccard dissimilarity is 1 minus the similarity.

The L1 (or Manhattan) and L2 (or Euclidean) dissimilarities are defined as follows. For two fuzzy multisets AA and BB on a given universe XX with elements xx, let MA(x)M_A(x) and MB(x)M_B(x) be functions returning the memberships of an element xx in sets AA and BB, respectively. The memberships are returned in standard form, i.e. as an infinite vector of decreasing membership values, e.g. (1,0.3,0,0,)(1, 0.3, 0, 0, \dots). Let MA(x)iM_A(x)_i and MB(x)iM_B(x)_i denote the iith components of these membership vectors. Then the L1 distance is defined as:

d1(A,B)=xXi=1MA(x)iMB(x)id_1(A, B) = \sum_{x \in X}\sum_{i=1}{\infty}|M_A(x)_i - M_B(x)_i|

and the L2 distance as:

d2(A,B)=xXi=1MA(x)iMB(x)i2d_2(A, B) = \sqrt{\sum_{x \in X}\sum_{i=1}{\infty}|M_A(x)_i - M_B(x)_i|^2}

Value

A numeric value (similarity or dissimilarity, as specified).

Source

T. Matthe, R. De Caluwe, G. de Tre, A. Hallez, J. Verstraete, M. Leman, O. Cornelis, D. Moelants, and J. Gansemans (2006), Similarity Between Multi-valued Thesaurus Attributes: Theory and Application in Multimedia Systems, Flexible Query Answering Systems, Lecture Notes in Computer Science, Springer, 331–342.

K. Mizutani, R. Inokuchi, and S. Miyamoto (2008), Algorithms of Nonlinear Document Clustering Based on Fuzzy Multiset Model, International Journal of Intelligent Systems, 23, 176–198.

See Also

set.

Examples

A <- set("a", "b", "c")
B <- set("c", "d", "e")
set_similarity(A, B)
set_dissimilarity(A, B)

A <- gset(c("a", "b", "c"), c(0.3, 0.7, 0.9))
B <- gset(c("c", "d", "e"), c(0.2, 0.4, 0.5))
gset_similarity(A, B, "Jaccard")
gset_dissimilarity(A, B, "Jaccard")
gset_dissimilarity(A, B, "L1")
gset_dissimilarity(A, B, "L2")

A <- gset(c("a", "b", "c"), list(c(0.3, 0.7), 0.1, 0.9))
B <- gset(c("c", "d", "e"), list(0.2, c(0.4, 0.5), 0.8))
gset_similarity(A, B, "Jaccard")
gset_dissimilarity(A, B, "Jaccard")
gset_dissimilarity(A, B, "L1")
gset_dissimilarity(A, B, "L2")

[Package sets version 1.0-25 Index]