textcat_profile_db {textcat} | R Documentation |
Textcat Profile Dbs
Description
Create n
-gram profile dbs for text categorization.
Usage
textcat_profile_db(x, id = NULL, method = NULL, ...,
options = list(), profiles = NULL)
Arguments
x |
a character vector of text documents, or an R object of text
documents extractable via |
id |
a character vector giving the categories of the texts to be
recycled to the length of |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for creating profiles. |
options |
a list of such options. |
profiles |
a textcat profile db object. |
Details
The text documents are split according to the given categories, and
n
-gram profiles are computed using the specified method, with
options either those used for creating profiles
if this is not
NULL
, or by combining the options given in ...
and
options
and merging with the default profile options specified
by the textcat option profile_options
using exact
name matching. The method and options employed for building the db
are stored in the db as attributes "method"
and
"options"
, respectively.
There is a c
method for combining profile dbs provided
that these have identical options. There are also a [
method
for subscripting and as.matrix
and
as.simple_triplet_matrix
methods to
“export” the profiles to a dense matrix or the sparse simple
triplet matrix representation provided by package slam,
respectively.
Currently, the only available built-in method is "textcnt"
,
which has the following options:
n
:-
A numeric vector giving the numbers of characters or bytes in the
n
-gram profiles.Default:
1 : 5
. split
:-
The regular expression pattern to be used in word splitting.
Default:
"[[:space:][:punct:][:digit:]]+"
. perl
:-
A logical indicating whether to use Perl-compatible regular expressions in word splitting.
Default:
FALSE
. tolower
:-
A logical indicating whether to transform texts to lower case (after word splitting).
Default:
TRUE
. reduce
:-
A logical indicating whether a representation of
n
-grams more efficient than the one used by Cavnar and Trenkle should be employed.Default:
TRUE
. useBytes
:-
A logical indicating whether to use byte
n
-grams rather than charactern
-grams.Default:
FALSE
. ignore
:-
a character vector of
n
-grams to be ignored when computingn
-gram profiles.Default:
"_"
(corresponding to a word boundary). size
:-
The maximal number of
n
-grams used for a profile.Default:
1000L
.
This method uses textcnt
in package tau for
computing n
-gram profiles, with n
, split
,
perl
and useBytes
corresponding to the respective
textcnt
arguments, and option reduce
setting argument
marker
as needed. N
-grams listed in option ignore
are removed, and only the most frequent remaining ones retained, with
the maximal number given by option size
.
Unless the profile db uses bytes rather than characters (i.e., option
useBytes
is TRUE
), text documents in x
containing
non-ASCII characters must declare their encoding (see
Encoding
), and will be re-encoded to UTF-8.
Note that option n
specifies all numbers of characters
or bytes to be used in the profiles, and not just the maximal number:
e.g., taking n = 3
will create profiles only containing
tri-grams.
Examples
## Obtain the texts of the standard licenses shipped with R.
files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]",
full.names = TRUE)
texts <- sapply(files,
function(f) paste(readLines(f), collapse = "\n"))
names(texts) <- basename(files)
## Build a profile db using the same method and options as for building
## the ECIMCI character profiles.
profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles)
## Inspect the 10 most frequent n-grams in each profile.
lapply(profiles, head, 10L)
## Combine into one frequency table.
tab <- as.matrix(profiles)
tab[, 1 : 10]
## Determine languages.
textcat(profiles, ECIMCI_profiles)