search_contexts {corpustools}R Documentation

Search for documents or sentences using Boolean queries

Description

Search for documents or sentences using Boolean queries

Usage

search_contexts(
  tc,
  query,
  code = NULL,
  feature = "token",
  context_level = c("document", "sentence"),
  not = F,
  verbose = F,
  as_ascii = F
)

Arguments

tc

a tCorpus

query

A character string that is a query. See details for available query operators and modifiers. Can be multiple queries (as a vector), in which case it is recommended to also specifiy the code argument, to label results.

code

If given, used as a label for the results of the query. Especially usefull if multiple queries are used.

feature

The name of the feature column

context_level

Select whether the queries should occur within while "documents" or specific "sentences". Returns results at the specified level.

not

If TRUE, perform a NOT search. Return the articles/sentences for which the query is not found.

verbose

If TRUE, progress messages will be printed

as_ascii

if TRUE, perform search in ascii.

Details

Brief summary of the query language

The following operators and modifiers are supported:

Value

A contextHits object, which is a list with $hits (data.frame with locations) and $queries (copy of queries for provenance)

Examples

text = c('A B C', 'D E F. G H I', 'A D', 'GGG')
tc = create_tcorpus(text, doc_id = c('a','b','c','d'), split_sentences = TRUE)
tc$tokens

hits = search_contexts(tc, c('query label# A AND B', 'second query# (A AND Q) OR ("D E") OR I'))
hits          ## print shows number of hits
hits$hits     ## hits is a list, with hits$hits being a data.frame with specific contexts
summary(hits) ## summary gives hits per query

## sentence level
hits = search_contexts(tc, c('query label# A AND B', 'second query# (A AND Q) OR ("D E") OR I'),
                          context_level = 'sentence')
hits$hits     ## hits is a list, with hits$hits being a data.frame with specific contexts



## query language examples

## single term
search_contexts(tc, 'A')$hits

search_contexts(tc, 'G*')$hits    ## wildcard *
search_contexts(tc, '*G')$hits    ## wildcard *
search_contexts(tc, 'G*G')$hits   ## wildcard *

search_contexts(tc, 'G?G')$hits   ## wildcard ?
search_contexts(tc, 'G?')$hits    ## wildcard ? (no hits)

## boolean
search_contexts(tc, 'A AND B')$hits
search_contexts(tc, 'A AND D')$hits
search_contexts(tc, 'A AND (B OR D)')$hits

search_contexts(tc, 'A NOT B')$hits
search_contexts(tc, 'A NOT (B OR D)')$hits


## sequence search (adjacent words)
search_contexts(tc, '"A B"')$hits
search_contexts(tc, '"A C"')$hits ## no hit, because not adjacent

search_contexts(tc, '"A (B OR D)"')$hits ## can contain nested OR
## cannot contain nested AND or NOT!!

search_contexts(tc, '<A B>')$hits ## can also use <> instead of "".

## proximity search (using ~ flag)
search_contexts(tc, '"A C"~5')$hits ## A AND C within a 5 word window
search_contexts(tc, '"A C"~1')$hits ## no hit, because A and C more than 1 word apart

search_contexts(tc, '"A (B OR D)"~5')$hits ## can contain nested OR
search_contexts(tc, '"A <B C>"~5')$hits    ## can contain nested sequence (must use <>)
search_contexts(tc, '<A <B C>>~5')$hits    ## (<> is always OK, but cannot nest quotes in quotes)
## cannot contain nested AND or NOT!!


## case sensitive search
search_contexts(tc, 'g')$hits     ## normally case insensitive
search_contexts(tc, 'g~s')$hits   ## use ~s flag to make term case sensitive

search_contexts(tc, '(a OR g)~s')$hits   ## use ~s flag on everything between parentheses
search_contexts(tc, '(a OR G)~s')$hits   ## use ~s flag on everything between parentheses

search_contexts(tc, '"a b"~s')$hits   ## use ~s flag on everything between quotes
search_contexts(tc, '"A B"~s')$hits   ## use ~s flag on everything between quotes



[Package corpustools version 0.5.1 Index]