R: Build the frequency list of a corpus

freqlist {mclm}

R Documentation

Build the frequency list of a corpus

Description

This function builds the word frequency list from a corpus.

Usage

freqlist(
  x,
  re_drop_line = NULL,
  line_glue = NULL,
  re_cut_area = NULL,
  re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
  re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
  re_drop_token = NULL,
  re_token_transf_in = NULL,
  token_transf_out = NULL,
  token_to_lower = TRUE,
  perl = TRUE,
  blocksize = 300,
  verbose = FALSE,
  show_dots = FALSE,
  dot_blocksize = 10,
  file_encoding = "UTF-8",
  ngram_size = NULL,
  max_skip = 0,
  ngram_sep = "_",
  ngram_n_open = 0,
  ngram_open = "[]",
  as_text = FALSE
)

Arguments

`x`	Either a list of filenames of the corpus files (if `as_text` is `TRUE`) or the actual text of the corpus (if `as_text` is `FALSE`). If `as_text` is `TRUE` and the length of the vector `x` is higher than one, then each item in `x` is treated as a separate line (or a separate series of lines) in the corpus text. Within each item of `x`, the character `"\\n"` is also treated as a line separator.
`re_drop_line`	`NULL` or character vector. If `NULL`, it is ignored. Otherwise, a character vector (assumed to be of length 1) containing a regular expression. Lines in `x` that contain a match for `re_drop_line` are treated as not belonging to the corpus and are excluded from the results.
`line_glue`	`NULL` or character vector. If `NULL`, it is ignored. Otherwise, all lines in a corpus file (or in `x`, if `as_text` is `TRUE`), are glued together in one character vector of length 1, with the string `line_glue` pasted in between consecutive lines. The value of `line_glue` can also be equal to the empty string `""`. The 'line glue' operation is conducted immediately after the 'drop line' operation.
`re_cut_area`	`NULL` or character vector. If `NULL`, it is ignored. Otherwise, all matches in a corpus file (or in `x`, if `as_text` is `TRUE`), are 'cut out' of the text prior to the identification of the tokens in the text (and are therefore not taken into account when identifying the tokens). The 'cut area' operation is conducted immediately after the 'line glue' operation.
`re_token_splitter`	Regular expression or `NULL`. Regular expression that identifies the locations where lines in the corpus files are split into tokens. (See Details.) The 'token identification' operation is conducted immediately after the 'cut area' operation.
`re_token_extractor`	Regular expression that identifies the locations of the actual tokens. This argument is only used if `re_token_splitter` is `NULL`. (See Details.) The 'token identification' operation is conducted immediately after the 'cut area' operation.
`re_drop_token`	Regular expression or `NULL`. If `NULL`, it is ignored. Otherwise, it identifies tokens that are to be excluded from the results. Any token that contains a match for `re_drop_token` is removed from the results. The 'drop token' operation is conducted immediately after the 'token identification' operation.
`re_token_transf_in`	Regular expression that identifies areas in the tokens that are to be transformed. This argument works together with the argument `token_transf_out`. If both `re_token_transf_in` and `token_transf_out` differ from `NA`, then all matches, in the tokens, for the regular expression `re_token_transf_in` are replaced with the replacement string `token_transf_out`. The 'token transformation' operation is conducted immediately after the 'drop token' operation.
`token_transf_out`	Replacement string. This argument works together with `re_token_transf_in` and is ignored if `re_token_transf_in` is `NULL` or `NA`.
`token_to_lower`	Logical. Whether tokens must be converted to lowercase before returning the result. The 'token to lower' operation is conducted immediately after the 'token transformation' operation.
`perl`	Logical. Whether the PCRE regular expression flavor is being used in the arguments that contain regular expressions.
`blocksize`	Number that indicates how many corpus files are read to memory `⁠at each individual step' during the steps in the procedure; normally the default value of ⁠`300' should not be changed, but when one works with exceptionally small corpus files, it may be worthwhile to use a higher number, and when one works with exceptionally large corpus files, it may be worthwhile to use a lower number.
`verbose`	If`TRUE`, messages are printed to the console to indicate progress.
`show_dots`, `dot_blocksize`	If `TRUE`, dots are printed to the console to indicate progress.
`file_encoding`	File encoding that is assumed in the corpus files.
`ngram_size`	Argument in support of ngrams/skipgrams (see also `max_skip`). If one wants to identify individual tokens, the value of `ngram_size` should be `NULL` or `1`. If one wants to retrieve token ngrams/skipgrams, `ngram_size` should be an integer indicating the size of the ngrams/skipgrams. E.g. `2` for bigrams, or `3` for trigrams, etc.
`max_skip`	Argument in support of skipgrams. This argument is ignored if `ngram_size` is `NULL` or is `1`. If `ngram_size` is `2` or higher, and `max_skip` is `0`, then regular ngrams are being retrieved (albeit that they may contain open slots; see `ngram_n_open`). If `ngram_size` is `2` or higher, and `max_skip` is `1` or higher, then skipgrams are being retrieved (which in the current implementation cannot contain open slots; see `ngram_n_open`). For instance, if `ngram_size` is `3` and `max_skip` is `2`, then 2-skip trigrams are being retrieved. Or if `ngram_size` is `5` and `max_skip` is `3`, then 3-skip 5-grams are being retrieved.
`ngram_sep`	Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
`ngram_n_open`	If `ngram_size` is `2` or higher, and moreover `ngram_n_open` is a number higher than `0`, then ngrams with 'open slots' in them are retrieved. These ngrams with 'open slots' are generalizations of fully lexically specific ngrams (with the generalization being that one or more of the items in the ngram are replaced by a notation that stands for 'any arbitrary token'). For instance, if `ngram_size` is `4` and `ngram_n_open` is `1`, and if moreover the input contains a 4-gram `"it_is_widely_accepted"`, then the output will contain all modifications of `"it_is_widely_accepted"` in which one (since `ngram_n_open` is `1`) of the items in this n-gram is replaced by an open slot. The first and the last item inside an ngram are never turned into an open slot; only the items in between are candidates for being turned into open slots. Therefore, in the example, the output will contain `"it_[]_widely_accepted"` and `"it_is_[]_accepted"`. As a second example, if `ngram_size` is `5` and `ngram_n_open` is `2`, and if moreover the input contains a 5-gram `"it_is_widely_accepted_that"`, then the output will contain `"it_[]_[]_accepted_that"`, `"it_[]_widely_[]_that"`, and `"it_is_[]_[]_that"`.
`ngram_open`	Character string used to represent open slots in ngrams in the output of this function.
`as_text`	Logical. Whether `x` is to be interpreted as a character vector containing the actual contents of the corpus (if `as_text` is `TRUE`) or as a character vector containing the names of the corpus files (if `as_text` is `FALSE`). If if `as_text` is `TRUE`, then the arguments `blocksize`, `verbose`, `show_dots`, `dot_blocksize`, and `file_encoding` are ignored.

Details

The actual token identification is either based on the re_token_splitter argument, a regular expression that identifies the areas between the tokens, or on re_token_extractor, a regular expression that identifies the area that are the tokens. The first mechanism is the default mechanism: the argument re_token_extractor is only used if re_token_splitter is NULL. Currently the implementation of re_token_extractor is a lot less time-efficient than that of re_token_splitter.

Value

An object of class freqlist, which is based on the class table. It has additional attributes and methods such as:

base print(), as_data_frame(), summary() and sort,
tibble::as_tibble(),
an interactive explore() method,
various getters, including tot_n_tokens(), n_types(), n_tokens(), values that are also returned by summary(), and more,
subsetting methods such as keep_types(), keep_pos(), etc. including ⁠[]⁠ subsetting (see brackets).

Additional manipulation functions include type_freqs() to extract the frequencies of different items, freqlist_merge() to combine frequency lists, and freqlist_diff() to subtract a frequency list from another.

Objects of class freqlist can be saved to file with write_freqlist(); these files can be read with read_freqlist().

Examples

toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."

(flist <- freqlist(toy_corpus, as_text = TRUE))
print(flist, n = 20)
as.data.frame(flist)
as_tibble(flist)
summary(flist) 
print(summary(flist))

t_splitter <- "(?xi) [:\\s.;,?!\"]+"
freqlist(toy_corpus,
         re_token_splitter = t_splitter,
         as_text = TRUE)
         
freqlist(toy_corpus,
         re_token_splitter = t_splitter,
         token_to_lower = FALSE,
         as_text = TRUE)

t_extractor <- "(?xi) ( [:;?!] | [.]+ | [\\w'-]+ )"
freqlist(toy_corpus,
        re_token_splitter = NA,
        re_token_extractor = t_extractor,
        as_text = TRUE)

freqlist(letters, ngram_size = 3, as_text = TRUE)

freqlist(letters, ngram_size = 2, ngram_sep = " ", as_text = TRUE)

[Package mclm version 0.2.7 Index]