textcat_xdist {textcat} | R Documentation |
Cross-Distances Between N
-Gram Profiles
Description
Compute cross-distances between collections of n
-gram profiles.
Usage
textcat_xdist(x, p = NULL, method = "CT", ..., options = list())
Arguments
x |
a textcat profile db (see |
p |
|
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for computing distances. |
options |
a list of such options. |
Details
If x
(or p
) is not a profile db, the n
-gram
profiles of the individual text documents extracted from it are
computed using the profile method and options in p
if this is a
profile db, and using the current textcat profile method and
options otherwise.
Currently, the following distance methods for n
-gram profiles
are available.
"CT"
:the out-of-place measure of Cavnar and Trenkle.
"ranks"
:a variant of the Cavnar/Trenkle measure based on the aggregated absolute difference of the ranks of the combined
n
-grams in the two profiles."ALPD"
:the sum of the absolute differences in
n
-gram log frequencies."KLI"
:the Kullback-Leibler I-divergence
I(p, q) = \sum_i p_i \log(p_i/q_i)
of then
-gram frequency distributionsp
andq
of the two profiles."KLJ"
:the Kullback-Leibler J-divergence
J(p, q) = \sum_i (p_i - q_i) \log(p_i/q_i)
, the symmetrized variantI(p, q) + I(q, p)
of the I-divergences."JS"
:the Jensen-Shannon divergence between the
n
-gram frequency distributions."cosine"
the cosine dissimilarity between the profiles, i.e., one minus the inner product of the frequency vectors normalized to Euclidean length one (and filled with zeros for entries missing in one of the vectors).
"Dice"
the Dice dissimilarity, i.e., the fraction of
n
-grams present in one of the profiles only.
For the measures based on distances of frequency distributions,
n
-grams of the two profiles are combined, and missing
n
-grams are given a small positive absolute frequency which can
be controlled by option eps
, and defaults to 1e-6.
Options given in ...
and options
are combined, and
merged with the default xdist options specified by the textcat
option xdist_options
using exact name matching.
Examples
## Compute cross-distances between the TextCat byte profiles using the
## CT out-of-place measure.
d <- textcat_xdist(TC_byte_profiles)
## Visualize results of hierarchical cluster analysis on the distances.
plot(hclust(as.dist(d)), cex = 0.7)