tokens {mclm} | R Documentation |
Create or coerce an object into class tokens
Description
tokenize()
splits a text into a sequence of tokens, using regular expressions
to identify them, and returns an object of the class tokens
.
Usage
tokenize(
x,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]"
)
Arguments
x |
Either a character vector or an object of class NLP::TextDocument that contains the text to be tokenized. |
re_drop_line |
|
line_glue |
|
re_cut_area |
|
re_token_splitter |
Regular expression or The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_token_extractor |
Regular expression that identifies the locations of the
actual tokens. This argument is only used if The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_drop_token |
Regular expression or |
re_token_transf_in |
Regular expression that identifies areas in the
tokens that are to be transformed. This argument works together with the argument
If both The 'token transformation' operation is conducted immediately after the 'drop token' operation. |
token_transf_out |
Replacement string. This argument works together with
|
token_to_lower |
Logical. Whether tokens must be converted to lowercase before returning the result. The 'token to lower' operation is conducted immediately after the 'token transformation' operation. |
perl |
Logical. Whether the PCRE regular expression flavor is being used in the arguments that contain regular expressions. |
ngram_size |
Argument in support of ngrams/skipgrams (see also If one wants to identify individual tokens, the value of |
max_skip |
Argument in support of skipgrams. This argument is ignored if
If If For instance, if |
ngram_sep |
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function. |
ngram_n_open |
If For instance, if As a second example, if |
ngram_open |
Character string used to represent open slots in ngrams in the output of this function. |
Details
If the output contains ngrams with open slots, then the order
of the items in the output is no longer meaningful. For instance, let's imagine
a case where ngram_size
is 5
and ngram_n_open
is 2
.
If the input contains a 5-gram "it_is_widely_accepted_that"
, then the output
will contain "it_[]_[]_accepted_that"
, "it_[]_widely_[]_that"
and
"it_is_[]_[]_that"
. The relative order of these three items in the output
must be considered arbitrary.
Value
An object of class tokens
, i.e. a sequence of tokens.
It has a number of attributes and method such as:
base
print
,as_data_frame()
,summary()
(which returns the number of items),sort()
andrev()
,an interactive
explore()
method,some getters, namely
n_tokens()
andn_types()
,subsetting methods such as
keep_types()
,keep_pos()
, etc. including[]
subsetting (see brackets).
Additional manipulation functions include the trunc_at()
method to ??,
tokens_merge()
and tokens_merge_all()
to combine token lists and an
as_character()
method to convert to a character vector.
Objects of class tokens
can be saved to file with write_tokens()
;
these files can be read with read_freqlist()
.
See Also
Examples
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."
tks <- tokenize(toy_corpus)
print(tks, n = 1000)
tks <- tokenize(toy_corpus, re_token_splitter = "\\W+")
print(tks, n = 1000)
sort(tks)
summary(tks)
tokenize(toy_corpus, ngram_size = 3)
tokenize(toy_corpus, ngram_size = 3, max_skip = 2)
tokenize(toy_corpus, ngram_size = 3, ngram_n_open = 1)