collocationAnalysis,KorAPConnection-method {RKorAPClient} | R Documentation |
Collocation analysis
Description
Performs a collocation analysis for the given node (or query) in the given virtual corpus.
Usage
## S4 method for signature 'KorAPConnection'
collocationAnalysis(
kco,
node,
vc = "",
lemmatizeNodeQuery = FALSE,
minOccur = 5,
leftContextSize = 5,
rightContextSize = 5,
topCollocatesLimit = 200,
searchHitsSampleLimit = 20000,
ignoreCollocateCase = FALSE,
withinSpan = ifelse(exactFrequencies, "base/s=s", ""),
exactFrequencies = TRUE,
stopwords = append(RKorAPClient::synsemanticStopwords(), node),
seed = 7,
expand = length(vc) != length(node),
maxRecurse = 0,
addExamples = FALSE,
thresholdScore = "logDice",
threshold = 2,
localStopwords = c(),
collocateFilterRegex = "^[:alnum:]+-?[:alnum:]*$",
...
)
Arguments
kco |
|
node |
target word |
vc |
string describing the virtual corpus in which the query should be performed. An empty string (default) means the whole corpus, as far as it is license-wise accessible. |
lemmatizeNodeQuery |
if TRUE, node query will be lemmatized, i.e. |
minOccur |
minimum absolute number of observed co-occurrences to consider a collocate candidate |
leftContextSize |
size of the left context window |
rightContextSize |
size of the right context window |
topCollocatesLimit |
limit analysis to the n most frequent collocates in the search hits sample |
searchHitsSampleLimit |
limit the size of the search hits sample |
ignoreCollocateCase |
logical, set to TRUE if collocate case should be ignored |
withinSpan |
KorAP span specification for collocations to be searched within |
exactFrequencies |
if FALSE, extrapolate observed co-occurrence frequencies from frequencies in search hits sample, otherwise retrieve exact co-occurrence frequencies |
stopwords |
vector of stopwords not to be considered as collocates |
seed |
seed for random page collecting order |
expand |
if TRUE, |
maxRecurse |
apply collocation analysis recursively |
addExamples |
If TRUE, examples for instances of collocations will be added in a column |
thresholdScore |
association score function (see |
threshold |
minimum value of |
localStopwords |
vector of stopwords that will not be considered as collocates in the current function call, but that will not be passed to recursive calls |
collocateFilterRegex |
allow only collocates matching the regular expression |
... |
more arguments will be passed to |
Details
The collocation analysis is currently implemented on the client side, as some of the functionality is not yet provided by the KorAP backend. Mainly for this reason it is very slow (several minutes, up to hours), but on the other hand very flexible. You can, for example, perform the analysis in arbitrary virtual corpora, use complex node queries, and look for expression-internal collocates using the focus function (see examples and demo).
To increase speed at the cost of accuracy and possible false negatives, you can decrease searchHitsSampleLimit and/or topCollocatesLimit and/or set exactFrequencies to FALSE.
Note that currently not the tokenization provided by the backend, i.e. the corpus itself, is used, but a tinkered one. This can also lead to false negatives and to frequencies that differ from corresponding ones acquired via the web user interface.
Value
Tibble with top collocates, association scores, corresponding URLs for web user interface queries, etc.
See Also
Other collocation analysis functions:
association-score-functions
,
collocationScoreQuery,KorAPConnection-method
,
synsemanticStopwords()
Examples
## Not run:
# Find top collocates of "Packung" inside and outside the sports domain.
new("KorAPConnection", verbose = TRUE) %>%
collocationAnalysis("Packung", vc=c("textClass=sport", "textClass!=sport"),
leftContextSize=1, rightContextSize=1, topCollocatesLimit=20) %>%
dplyr::filter(logDice >= 5)
## End(Not run)
## Not run:
# Identify the most prominent light verb construction with "in ... setzen".
# Note that, currently, the use of focus function disallows exactFrequencies.
new("KorAPConnection", verbose = TRUE) %>%
collocationAnalysis("focus(in [tt/p=NN] {[tt/l=setzen]})",
leftContextSize=1, rightContextSize=0, exactFrequencies=FALSE, topCollocatesLimit=20)
## End(Not run)