dictionary {kgrams} | R Documentation |
Word dictionaries
Description
Construct or coerce to and from a dictionary.
Usage
dictionary(object, ...)
## S3 method for class 'kgram_freqs'
dictionary(object, size = NULL, cov = NULL, thresh = NULL, ...)
## S3 method for class 'character'
dictionary(
object,
.preprocess = identity,
size = NULL,
cov = NULL,
thresh = NULL,
...
)
## S3 method for class 'connection'
dictionary(
object,
.preprocess = identity,
size = NULL,
cov = NULL,
thresh = NULL,
max_lines = Inf,
batch_size = max_lines,
...
)
as_dictionary(object)
## S3 method for class 'kgrams_dictionary'
as_dictionary(object)
## S3 method for class 'character'
as_dictionary(object)
## S3 method for class 'kgrams_dictionary'
as.character(x, ...)
Arguments
object |
object from which to extract a dictionary, or to be coerced to dictionary. |
... |
further arguments passed to or from other methods. |
size |
either |
cov |
either |
thresh |
either |
.preprocess |
a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before creating the dictionary. |
max_lines |
a length one positive integer or |
batch_size |
a length one positive integer less than or equal to
|
x |
a |
Details
These generic functions are used to build dictionary
objects,
or to coerce from other formats to dictionary
, and from a
dictionary
to a character vector. By now, the only
non-trivial type coercible to dictionary
is character
,
in which case each entry of the input vector is considered as a single word.
Coercion from dictionary
to character
returns the list of
words included in the dictionary as a regular character vector.
Dictionaries can be extracted from kgram_freqs
objects, or built
from text coming either directly from a character vector or a connection.
A single preprocessing transformation can be applied before processing the text for unique words. After preprocessing, anything delimited by one or more white space characters in the transformed text input is counted as a word and may be added to the dictionary modulo additional constraints.
The possible constraints for including a word in the dictionary can be of
three types: (i) fixed size of dictionary, implemented by the size
argument; (ii) fixed text covering fraction, as specified by the cov
argument; or (iii) minimum word count threshold, thresh
argument.
Only one of these constraints can be applied at a time,
so that specifying more than one of size
, cov
or thresh
results in an error.
Value
A dictionary
for dictionary()
and
as_dictionary()
, a character vector for the as.character()
method.
Author(s)
Valerio Gherardi
Examples
# Building a dictionary from Shakespeare's "Much Ado About Nothing"
dict <- dictionary(much_ado)
length(dict)
query(dict, "leonato") # TRUE
query(dict, c("thy", "thou")) # c(TRUE, TRUE)
query(dict, "smartphones") # FALSE
# Getting list of words as regular character vector
words <- as.character(dict)
head(words)
# Building a dictionary from a list of words
dict <- as_dictionary(c("i", "the", "a"))