R: Build collocation frequencies.

create_cooc {mclm}

R Documentation

Build collocation frequencies.

Description

These functions builds a surface or textual collocation frequency for a specific node.

Usage

surf_cooc(
  x,
  re_node,
  w_left = 3,
  w_right = 3,
  re_boundary = NULL,
  re_drop_line = NULL,
  line_glue = NULL,
  re_cut_area = NULL,
  re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
  re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
  re_drop_token = NULL,
  re_token_transf_in = NULL,
  token_transf_out = NULL,
  token_to_lower = TRUE,
  perl = TRUE,
  blocksize = 300,
  verbose = FALSE,
  dot_blocksize = 10,
  file_encoding = "UTF-8"
)

text_cooc(
  x,
  re_node,
  re_boundary = NULL,
  re_drop_line = NULL,
  line_glue = NULL,
  re_cut_area = NULL,
  re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
  re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
  re_drop_token = NULL,
  re_token_transf_in = NULL,
  token_transf_out = NULL,
  token_to_lower = TRUE,
  perl = TRUE,
  blocksize = 300,
  verbose = FALSE,
  dot_blocksize = 10,
  file_encoding = "UTF-8"
)

Arguments

`x`	List of filenames of the corpus files.
`re_node`	Regular expression used for identifying instances of the 'node', i.e. the target item for which collocation information is collected.
`w_left`	Number of tokens to the left of the 'node' that are treated as belonging to the co-text of the 'node'. (But also see `re_boundary`.)
`w_right`	Number of tokens to the right of the 'node' that are treated as belonging to the co-text of the 'node'. (But also see `re_boundary`.)
`re_boundary`	Regular expression. For `text_cooc()`, it identifies boundaries between 'textual units'. For `surf_cooc()`, it identifies 'cut-off' points for the co-text of the 'node'. If it is not `NULL`, the maximum length of the left and right co-texts are still given by `w_left` and `w_right`, but if a match for `re_boundary` is found within the co-text, both the 'boundary token' and all tokens beyond it are excluded.
`re_drop_line`	Regular expression or `NULL`. if `NULL`, the argument is ignored. Otherwise, lines in the corpus that match it are treated as not belonging to the corpus and excluded from the results.
`line_glue`	Character vector or `NULL`. if `NULL`, the argument is ignored. Otherwise, all the lines in the corpus are glued together in one character vector of length 1, with the string `line_glue` pasted in between consecutive lines. This value can also be equal to an empty string `""`. The 'line glue' operation is conducted immediately after the 'drop line' operation.
`re_cut_area`	Regular expression or `NULL`. if `NULL`, the argument is ignored. Otherwise, all matches in the corpus are 'cut out' from the text prior to the identification of the tokens and are therefore not taken into account when identifying the tokens. The 'cut area' operation is conducted immediately after the 'line glue' operation.
`re_token_splitter`	Regular expression or `NULL`. if `NULL`, the argument is ignored and `re_token_extractor` is used instead. Otherwise, it identifies the areas between the tokens within a line of the corpus. The 'token identification' operation is conducted immediately after the 'cut area' operation.
`re_token_extractor`	Regular expression that identifies the locations of the actual tokens. It is only used if `re_token_splitter` is `NULL`. Currently the implementation of this argument is a lot less time-efficient than that of `re_token_splitter`. The 'token identification' operation is conducted immediately after the 'cut area' operation.
`re_drop_token`	Regular expression or `NULL`. if `NULL`, the argument is ignored. Otherwise, it identifies tokens to be excluded from the results. The 'drop token' operation is conducted immediately after the 'token identification' operation.
`re_token_transf_in`	A regular expression that identifies areas in the tokens that are to be transformed. This argument works together with `token_transf_out`. If either of them is `NULL`, they are both ignored. Otherwise, all matches in the tokens for `re_token_transf_in` are replaced with the replacement string `token_transf_out`. The 'token transformation' operation is conducted immediately after the 'drop token' transformation.
`token_transf_out`	A 'replacement string'. This argument works together with `re_token_transf_in` and is ignored if either argument is `NULL`.
`token_to_lower`	Logical. Whether tokens should be converted to lowercase before returning the results. The 'token to lower' operation is conducted immediately after the 'token transformation' operation.
`perl`	Logical. Whether the PCRE flavor of regular expressions should be used in the arguments that contain regular expressions.
`blocksize`	Number indicating how many corpus files are read to memory 'at each individual step' during the steps in the procedure. Normally the default value of `300` should not be changed, but when one works with exceptionally small corpus files, it may be worthwhile to use a higher number, and when one works with exceptionally large corpus files, it may be worthwhile to use a lower number.
`verbose`	Logical. If `TRUE`, messages are printed to the console to indicate progress.
`dot_blocksize`	Logical. If `TRUE`, dots are printed to the console to indicate progress.
`file_encoding`	Encoding of the input files. Either a character vector of length 1, in which case all files are assumed to be in the same encoding, or a character vector with the same length as `x`, which allows for different encodings for different files.

Details

Two major steps can be distinguished in the procedure conducted by these functions. The first major step is the identification of the (sequence of) tokens that, for the purpose of this analysis, will be considered to be the content of the corpus.

The function arguments that jointly determine the details of this step are re_drop_line, line_glue, re_cut_area, re_token_splitter, re_token_extractor, re_drop_token, re_token_transf_in, token_transf_out, and token_to_lower. The sequence of tokens that is the ultimate outcome of this step is then handed over to the second major step of the procedure.

The second major step is the establishment of the co-occurrence frequencies. The function arguments that jointly determine the details of this step are re_node and re_boundary for both functions, and w_left and w_right for surf_cooc() only. It is important to know that this second step is conducted after the tokens of the corpus have been identified, and that it is applied to a sequence of tokens, not to the original text. More specifically the regular expressions re_node and re_boundary are tested against individual tokens, as they are identified by the token identification procedure. Moreover, in surf_cooc(), the numbers w_left and w_right also apply to tokens a they are identified by the token identification procedure.

Value

An object of class cooc_info, containing information on co-occurrence frequencies.

Functions

surf_cooc(): Build surface collocation frequencies
text_cooc(): Build textual collocation frequencies

[Package mclm version 0.2.7 Index]