query {koRpus} | R Documentation |
A method to get information out of koRpus objects
Description
The method query
returns query information from objects of classes kRp.corp.freq
and
kRp.text
.
Usage
query(obj, ...)
## S4 method for signature 'kRp.corp.freq'
query(
obj,
var = NULL,
query,
rel = "eq",
as.df = TRUE,
ignore.case = TRUE,
perl = FALSE,
regexp_var = "word"
)
## S4 method for signature 'kRp.text'
query(
obj,
var,
query,
rel = "eq",
as.df = TRUE,
ignore.case = TRUE,
perl = FALSE,
regexp_var = "token"
)
## S4 method for signature 'data.frame'
query(
obj,
var,
query,
rel = "eq",
as.df = TRUE,
ignore.case = TRUE,
perl = FALSE,
regexp_var = "token"
)
Arguments
obj |
An object of class |
... |
Optional arguments, see above. |
var |
A character string naming a variable in the object (i.e., colname). If set to
|
query |
A character vector (for words), regular expression, or single number naming values to be matched in the variable. Can also be a vector of two numbers to query a range of frequency data, or a list of named lists for multiple queries (see "Query lists" section in details). |
rel |
A character string defining the relation of the queried value and desired results.
Must either be |
as.df |
Logical, if |
ignore.case |
Logical, passed through to |
perl |
Logical, passed through to |
regexp_var |
A character string naming the column to query if |
Details
kRp.corp.freq: Depending on the setting of the var
parameter,
will return entries with a matching character (var="word"
),
or all entries of the desired frequency (see the examples). A special case is the need for a range of frequencies,
which can be achieved by providing a nomerical vector of two values as the query
value,
for start and end of
the range, respectively. In these cases,
if rel
is set to "gt"
or "lt"
,
the given range borders are excluded, otherwise they will be included as true matches.
kRp.text: var
can be any of the variables in slot tokens
. If rel="num"
,
a vector with the row numbers in which the query was found is returned.
Value
Depending on the arguments, might include whole objects, lists, single values etc.
Query lists
You can combine an arbitrary number of queries in a simple way by providing a list of named lists to the
query
parameter, where each list contains one query request. In each list,
the first element name represents the
var
value of the request,
and its value is taken as the query
argument. You can also assign rel
,
ignore.case
and perl
for each request individually, and if you don't,
the settings of the main query call are
taken as default (as.df
only applies to the final query). The filters will be applied in the order given,
i.e., the
second query will be made to the results of the first.
This method calls subset
,
which might actually be even more flexible if you need more control.
See Also
Examples
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
tokenized.obj <- tokenize(
txt=sample_file,
lang="en"
)
en_corp <- read.corp.custom(
tokenized.obj,
caseSens=FALSE
)
# look up frequencies for the word "winner"
query(en_corp, var="word", query="winner")
# show all entries with a frequency of exactly 3 in the corpus
query(en_corp, "freq", 3)
# now, which tokens appear more than 40000 times in a million?
query(en_corp, "pmio", 40000, "gt")
# example for a range request: tokens with a log10 between 4.2 and 4.7
# (including these two values)
query(en_corp, "log10", c(4.2, 4.7))
# (and without them)
query(en_corp, "log10", c(4.2, 4.7), "gt")
# example for a list of queries: get words with a frequency between
# 10000 and 25000 per million and at least four letters
query(en_corp, query=list(
list(pmio=c(10000, 25000)),
list(lttr=4, rel="ge"))
)
# get all instances of "the" in a tokenized text object
query(tokenized.obj, "token", "the")
} else {}