slma {mclm} | R Documentation |
Stable lexical marker analysis
Description
This function conducts a stable lexical marker analysis.
Usage
slma(
x,
y,
file_encoding = "UTF-8",
sig_cutoff = qchisq(0.95, df = 1),
small_pos = 1e-05,
keep_intermediate = FALSE,
verbose = TRUE,
min_rank = 1,
max_rank = 5000,
keeplist = NULL,
stoplist = NULL,
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
...
)
Arguments
x , y |
Character vector or |
file_encoding |
Encoding of all the files to read. |
sig_cutoff |
Numeric value indicating the cutoff value for 'significance
in the stable lexical marker analysis. The default value is |
small_pos |
Alternative (but sometimes inferior) approach to dealing with
zero frequencies, compared to If |
keep_intermediate |
Logical. If |
verbose |
Logical. Whether progress should be printed to the console during analysis. |
min_rank , max_rank |
Minimum and maximum frequency rank in the first
corpus ( |
keeplist |
List of types that must certainly be included in the list of
candidate markers regardless of their frequency rank and of |
stoplist |
List of types that must not be included in the list of candidate
markers, although, if a type is included in |
ngram_size |
Argument in support of ngrams/skipgrams (see also If one wants to identify individual tokens, the value of |
max_skip |
Argument in support of skipgrams. This argument is ignored if
If If For instance, if |
ngram_sep |
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function. |
ngram_n_open |
If For instance, if As a second example, if |
ngram_open |
Character string used to represent open slots in ngrams in the output of this function. |
... |
Additional arguments. |
Details
A stable lexical marker analysis of the A-documents in x
versus the B-documents
in y
starts from a separate keyword analysis for all possible document couples
(a,b)
, with a an A-document and b a B-document. If there are n
A-documents and m B-documents, then n*m
keyword analyses are
conducted. The 'stability' of a linguistic item x, as a marker for the
collection of A-documents (when compared to the B-documents) corresponds
to the frequency and consistency with which x is found to be a keyword for
the A-documents across all aforementioned keyword analyses.
In any specific keyword analysis, x is considered a keyword for an A-document
if G_signed
is positive and moreover p_G
is less than sig_cutoff
(see assoc_scores()
for more information on the measures). Item x is
considered a keyword for the B-document if G_signed
is negative and moreover
p_G
is less than sig_cutoff
.
Value
An object of class slma
, which is a named list with at least the following
elements:
A
scores
dataframe with information about the stability of the chosen lexical items. (See below.)An
intermediate
list with a register of intermediate values ifkeep_intermediate
wasTRUE
.Named items registering the values of the arguments with the same name, namely
sig_cutoff
,small_pos
,x
, andy
.
The slma
object has as_data_frame()
and print
methods
as well as an ad-hoc details()
method. Note that the print
method simply prints the main dataframe.
Contents of the scores
element
The scores
element is a dataframe of which the rows are linguistic items
for which a stable lexical marker analysis was conducted and the columns are
different 'stability measures' and related statistics. By default, the
linguistic items are sorted by decreasing 'stability' according to the S_lor
measure.
Column | Name | Computation | Range of values |
S_abs | Absolute stability | S_att - S_rep | -(n*m) -- (n*m) |
S_nrm | Normalized stability | S_abs / n*m | -1 -- 1 |
S_att | Stability of attraction | Number of (a,b) couples in which the linguistic item is a keyword for the A-documents | 0 -- n*m |
S_rep | Stability of repulsion | Number of (a,b) couples in which the linguistic item is a keyword for the B-documents | 0 -- n*m |
S_lor | Log of odds ratio stability | Mean of log_OR across all (a,b) couples but setting to 0 the value when p_G is larger than sig_cutoff | |
S_lor
is then computed as a fraction with as its numerator the sum of all
log_OR
values across all (a,b)
couples for which p_G
is lower than
sig_cutoff
and as its denominator n*m
.
For more on log_OR
, see the Value section on on assoc_scores()
. The final
three columns on the output are meant as a tool in support of the interpretation
of the log_OR
column. Considering all (a,b)
couples for which
p_G
is smaller than sig_cutoff
, lor_min
, lor_max
and lor_sd
are their minimum, maximum and standard deviation for each element.
Examples
a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm"))
slma_ex <- slma(a_corp, b_corp)