freqlist {mclm} | R Documentation |
Build the frequency list of a corpus
Description
This function builds the word frequency list from a corpus.
Usage
freqlist(
x,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
show_dots = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8",
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
as_text = FALSE
)
Arguments
x |
Either a list of filenames of the corpus files
(if If |
re_drop_line |
|
line_glue |
|
re_cut_area |
|
re_token_splitter |
Regular expression or The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_token_extractor |
Regular expression that identifies the locations of the
actual tokens. This argument is only used if The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_drop_token |
Regular expression or |
re_token_transf_in |
Regular expression that identifies areas in the
tokens that are to be transformed. This argument works together with the argument
If both The 'token transformation' operation is conducted immediately after the 'drop token' operation. |
token_transf_out |
Replacement string. This argument works together with
|
token_to_lower |
Logical. Whether tokens must be converted to lowercase before returning the result. The 'token to lower' operation is conducted immediately after the 'token transformation' operation. |
perl |
Logical. Whether the PCRE regular expression flavor is being used in the arguments that contain regular expressions. |
blocksize |
Number that indicates how many corpus files are read to memory
|
verbose |
If |
show_dots , dot_blocksize |
If |
file_encoding |
File encoding that is assumed in the corpus files. |
ngram_size |
Argument in support of ngrams/skipgrams (see also If one wants to identify individual tokens, the value of |
max_skip |
Argument in support of skipgrams. This argument is ignored if
If If For instance, if |
ngram_sep |
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function. |
ngram_n_open |
If For instance, if As a second example, if |
ngram_open |
Character string used to represent open slots in ngrams in the output of this function. |
as_text |
Logical.
Whether |
Details
The actual token identification is either based on the re_token_splitter
argument, a regular expression that identifies the areas between the tokens,
or on re_token_extractor
, a regular expression that identifies the area
that are the tokens.
The first mechanism is the default mechanism: the argument re_token_extractor
is only used if re_token_splitter
is NULL
.
Currently the implementation of
re_token_extractor
is a lot less time-efficient than that of re_token_splitter
.
Value
An object of class freqlist
, which is based on the class table
.
It has additional attributes and methods such as:
base
print()
,as_data_frame()
,summary()
andsort
,an interactive
explore()
method,various getters, including
tot_n_tokens()
,n_types()
,n_tokens()
, values that are also returned bysummary()
, and more,subsetting methods such as
keep_types()
,keep_pos()
, etc. including[]
subsetting (see brackets).
Additional manipulation functions include type_freqs()
to extract the frequencies
of different items, freqlist_merge()
to combine frequency lists, and
freqlist_diff()
to subtract a frequency list from another.
Objects of class freqlist
can be saved to file with write_freqlist()
;
these files can be read with read_freqlist()
.
Examples
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."
(flist <- freqlist(toy_corpus, as_text = TRUE))
print(flist, n = 20)
as.data.frame(flist)
as_tibble(flist)
summary(flist)
print(summary(flist))
t_splitter <- "(?xi) [:\\s.;,?!\"]+"
freqlist(toy_corpus,
re_token_splitter = t_splitter,
as_text = TRUE)
freqlist(toy_corpus,
re_token_splitter = t_splitter,
token_to_lower = FALSE,
as_text = TRUE)
t_extractor <- "(?xi) ( [:;?!] | [.]+ | [\\w'-]+ )"
freqlist(toy_corpus,
re_token_splitter = NA,
re_token_extractor = t_extractor,
as_text = TRUE)
freqlist(letters, ngram_size = 3, as_text = TRUE)
freqlist(letters, ngram_size = 2, ngram_sep = " ", as_text = TRUE)