R: A method to get information out of koRpus objects

query {koRpus}

R Documentation

A method to get information out of koRpus objects

Description

The method query returns query information from objects of classes kRp.corp.freq and kRp.text.

Usage

query(obj, ...)

## S4 method for signature 'kRp.corp.freq'
query(
  obj,
  var = NULL,
  query,
  rel = "eq",
  as.df = TRUE,
  ignore.case = TRUE,
  perl = FALSE,
  regexp_var = "word"
)

## S4 method for signature 'kRp.text'
query(
  obj,
  var,
  query,
  rel = "eq",
  as.df = TRUE,
  ignore.case = TRUE,
  perl = FALSE,
  regexp_var = "token"
)

## S4 method for signature 'data.frame'
query(
  obj,
  var,
  query,
  rel = "eq",
  as.df = TRUE,
  ignore.case = TRUE,
  perl = FALSE,
  regexp_var = "token"
)

Arguments

`obj`	An object of class `kRp.corp.freq`, `kRp.text`, or `data.frame`.
`...`	Optional arguments, see above.
`var`	A character string naming a variable in the object (i.e., colname). If set to `"regexp"`, `grepl` is called on the column specified by `regexp_var`.
`query`	A character vector (for words), regular expression, or single number naming values to be matched in the variable. Can also be a vector of two numbers to query a range of frequency data, or a list of named lists for multiple queries (see "Query lists" section in details).
`rel`	A character string defining the relation of the queried value and desired results. Must either be `"eq"` (equal, the default), `"gt"` (greater than), `"ge"` (greater of equal), `"lt"` (less than) or `"le"` (less or equal). If `var="word"`, is always interpreted as `"eq"`
`as.df`	Logical, if `TRUE`, returns a data.frame, otherwise an object of the input class. Ignored if `obj` is a data frame already.
`ignore.case`	Logical, passed through to `grepl` if `var="regexp"`.
`perl`	Logical, passed through to `grepl` if `var="regexp"`.
`regexp_var`	A character string naming the column to query if `var="regexp"`.

Details

kRp.corp.freq: Depending on the setting of the var parameter, will return entries with a matching character (var="word"), or all entries of the desired frequency (see the examples). A special case is the need for a range of frequencies, which can be achieved by providing a nomerical vector of two values as the query value, for start and end of the range, respectively. In these cases, if rel is set to "gt" or "lt", the given range borders are excluded, otherwise they will be included as true matches.

kRp.text: var can be any of the variables in slot tokens. If rel="num", a vector with the row numbers in which the query was found is returned.

Value

Depending on the arguments, might include whole objects, lists, single values etc.

Query lists

You can combine an arbitrary number of queries in a simple way by providing a list of named lists to the query parameter, where each list contains one query request. In each list, the first element name represents the var value of the request, and its value is taken as the query argument. You can also assign rel, ignore.case and perl for each request individually, and if you don't, the settings of the main query call are taken as default (as.df only applies to the final query). The filters will be applied in the order given, i.e., the second query will be made to the results of the first.

This method calls subset, which might actually be even more flexible if you need more control.

Examples

# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  en_corp <- read.corp.custom(
    tokenized.obj,
    caseSens=FALSE
  )

  # look up frequencies for the word "winner"
  query(en_corp, var="word", query="winner")

  # show all entries with a frequency of exactly 3 in the corpus
  query(en_corp, "freq", 3)

  # now, which tokens appear more than 40000 times in a million?
  query(en_corp, "pmio", 40000, "gt")

  # example for a range request: tokens with a log10 between 4.2 and 4.7
  # (including these two values)
  query(en_corp, "log10", c(4.2, 4.7))
  # (and without them)
  query(en_corp, "log10", c(4.2, 4.7), "gt")

  # example for a list of queries: get words with a frequency between
  # 10000 and 25000 per million and at least four letters
  query(en_corp, query=list(
    list(pmio=c(10000, 25000)),
    list(lttr=4, rel="ge"))
  )

  # get all instances of "the" in a tokenized text object
  query(tokenized.obj, "token", "the")
} else {}

[Package koRpus version 0.13-8 Index]