create_cooc {mclm} | R Documentation |
Build collocation frequencies.
Description
These functions builds a surface or textual collocation frequency for a specific node.
Usage
surf_cooc(
x,
re_node,
w_left = 3,
w_right = 3,
re_boundary = NULL,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8"
)
text_cooc(
x,
re_node,
re_boundary = NULL,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8"
)
Arguments
x |
List of filenames of the corpus files. |
re_node |
Regular expression used for identifying instances of the 'node', i.e. the target item for which collocation information is collected. |
w_left |
Number of tokens to the left of the 'node' that are treated as
belonging to the co-text of the 'node'. (But also see |
w_right |
Number of tokens to the right of the 'node' that are treated as
belonging to the co-text of the 'node'. (But also see |
re_boundary |
Regular expression. For For |
re_drop_line |
Regular expression or |
line_glue |
Character vector or This value can also be equal to an empty string The 'line glue' operation is conducted immediately after the 'drop line' operation. |
re_cut_area |
Regular expression or The 'cut area' operation is conducted immediately after the 'line glue' operation. |
re_token_splitter |
Regular expression or The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_token_extractor |
Regular expression that identifies the locations of
the actual tokens. It is only used if The 'token identification' operation is conducted immediately after the 'cut area' operation. |
re_drop_token |
Regular expression or The 'drop token' operation is conducted immediately after the 'token identification' operation. |
re_token_transf_in |
A regular expression that identifies areas in the
tokens that are to be transformed. This argument works together with
Otherwise, all matches in the tokens for The 'token transformation' operation is conducted immediately after the 'drop token' transformation. |
token_transf_out |
A 'replacement string'. This argument works together
with |
token_to_lower |
Logical. Whether tokens should be converted to lowercase before returning the results. The 'token to lower' operation is conducted immediately after the 'token transformation' operation. |
perl |
Logical. Whether the PCRE flavor of regular expressions should be used in the arguments that contain regular expressions. |
blocksize |
Number indicating how many corpus files are read to memory
'at each individual step' during the steps in the procedure. Normally the
default value of |
verbose |
Logical. If |
dot_blocksize |
Logical. If |
file_encoding |
Encoding of the input files. Either a character vector of length 1, in which case all files are assumed
to be in the same encoding, or a character vector with the same length as
|
Details
Two major steps can be distinguished in the procedure conducted by these functions. The first major step is the identification of the (sequence of) tokens that, for the purpose of this analysis, will be considered to be the content of the corpus.
The function arguments that jointly determine the details of this step are
re_drop_line
, line_glue
, re_cut_area
, re_token_splitter
,
re_token_extractor
, re_drop_token
, re_token_transf_in
,
token_transf_out
, and token_to_lower
.
The sequence of tokens that is the ultimate outcome of this step is then
handed over to the second major step of the procedure.
The second major step is the establishment of the co-occurrence frequencies.
The function arguments that jointly determine the details of this step are
re_node
and re_boundary
for both functions,
and w_left
and w_right
for surf_cooc()
only.
It is important to know that this second step is conducted after the tokens
of the corpus have been identified, and that it is applied to a sequence of
tokens, not to the original text. More specifically the regular expressions
re_node
and re_boundary
are tested against individual tokens,
as they are identified by the token identification procedure.
Moreover, in surf_cooc()
, the numbers w_left
and w_right
also apply to tokens a they are identified by the token identification procedure.
Value
An object of class cooc_info
, containing information on
co-occurrence frequencies.
Functions
-
surf_cooc()
: Build surface collocation frequencies -
text_cooc()
: Build textual collocation frequencies