types {mclm} | R Documentation |
Build a 'types' object
Description
This function builds an object of the class types
.
Usage
types(
x,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
show_dots = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8",
ngram_size = NULL,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
as_text = FALSE
)
Arguments
x |
Either a list of filenames of the corpus files
(if If |
re_drop_line |
|
line_glue |
|
re_cut_area |
|
re_token_splitter |
Regular expression or The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_token_extractor |
Regular expression that identifies the locations of the
actual tokens. This argument is only used if The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_drop_token |
Regular expression or |
re_token_transf_in |
Regular expression that identifies areas in the
tokens that are to be transformed. This argument works together with the argument
If both The 'token transformation' operation is conducted immediately after the 'drop token' operation. |
token_transf_out |
Replacement string. This argument works together with
|
token_to_lower |
Logical. Whether tokens must be converted to lowercase before returning the result. The 'token to lower' operation is conducted immediately after the 'token transformation' operation. |
perl |
Logical. Whether the PCRE regular expression flavor is being used in the arguments that contain regular expressions. |
blocksize |
Number that indicates how many corpus files are read to memory
|
verbose |
If |
show_dots , dot_blocksize |
If |
file_encoding |
File encoding that is assumed in the corpus files. |
ngram_size |
Argument in support of ngrams/skipgrams (see also If one wants to identify individual tokens, the value of |
ngram_sep |
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function. |
ngram_n_open |
If For instance, if As a second example, if |
ngram_open |
Character string used to represent open slots in ngrams in the output of this function. |
as_text |
Logical.
Whether |
Details
The actual token identification is either based on the re_token_splitter
argument, a regular expression that identifies the areas between the tokens,
or on re_token_extractor
, a regular expression that identifies the area
that are the tokens.
The first mechanism is the default mechanism: the argument re_token_extractor
is only used if re_token_splitter
is NULL
.
Currently the implementation of
re_token_extractor
is a lot less time-efficient than that of re_token_splitter
.
Value
An object of the class types
, which is based on a character vector.
It has additional attributes and methods such as:
base
print()
,as_data_frame()
,sort()
andbase::summary()
(which returns the number of items and of unique items),subsetting methods such as
keep_types()
,keep_pos()
, etc. including[]
subsetting (see brackets).
An object of class types
can be merged with another by means of types_merge()
,
written to file with write_types()
and read from file with write_types()
.
See Also
Examples
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."
(tps <- types(toy_corpus, as_text = TRUE))
print(tps)
as.data.frame(tps)
as_tibble(tps)
sort(tps)
sort(tps, decreasing = TRUE)