textcat {textcat} | R Documentation |
N
-Gram Based Text Categorization
Description
Categorize texts by computing their n
-gram profiles, and finding
the closest category n
-gram profile.
Usage
textcat(x, p = textcat::TC_char_profiles, method = "CT", ...,
options = list())
Arguments
x |
a character vector of texts, or an R object which can be
coerced to this using |
p |
a textcat profile db. By default, the TextCat character
profiles are used (see |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for computing distances between profiles. |
options |
a list of such options. |
Details
For each given text, its n
-gram profile is computed using the
options in the category profile db. Then, the distance between this
profile and the category profiles is computed, and the text is
categorized into the category of the closest profile (if this is not
unique, NA
is obtained).
Unless the profile db uses bytes rather than characters, the texts in
x
should be encoded in UTF-8.
References
W. B. Cavnar and J. M. Trenkle (1994),
N
-Gram-Based Text Categorization.
In “Proceedings of SDAIR-94, 3rd Annual Symposium on Document
Analysis and Information Retrieval”, 161–175.
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367
K. Hornik, P. Mair, J. Rauch, W. Geiger, C. Buchta and I. Feinerer
(2013).
The textcat Package for n
-Gram Based Text Categorization in R.
Journal of Statistical Software, 52/6, 1–17.
doi:10.18637/jss.v052.i06.
Examples
textcat(c("This is an english sentence.",
"Das ist ein deutscher satz."))