R: Annotate a tokenlist based on rsyntax queries

annotate {rsyntax}

R Documentation

Annotate a tokenlist based on rsyntax queries

Description

This function has been renamed to annotate_tqueries.

Usage

annotate(
  tokens,
  column,
  ...,
  block = NULL,
  fill = TRUE,
  overwrite = FALSE,
  block_fill = FALSE,
  copy = TRUE,
  verbose = FALSE
)

Arguments

`tokens`	A tokenIndex data.table, or any data.frame coercible with as_tokenindex.
`column`	The name of the column in which the annotations are added. The unique ids are added as column_id
`...`	One or multiple tqueries, or a list of queries, as created with tquery. Queries can be given a named by using a named argument, which will be used in the annotation_id to keep track of which query was used.
`block`	Optionally, specify ids (doc_id - sentence - token_id triples) that are blocked from querying and filling (ignoring the id and recursive searches through the id).
`fill`	Logical. If TRUE (default) also assign the fill nodes (as specified in the tquery). Otherwise these are ignored
`overwrite`	If TRUE, existing column will be overwritten. Otherwise (default), the exsting annotations in the column will be blocked, and new annotations will be added. This is identical to using multiple queries.
`block_fill`	If TRUE (and overwrite is FALSE), the existing fill nodes will also be blocked. In other words, the new annotations will only be added if the
`copy`	If TRUE (default), the data.table is copied. Otherwise, it is changed by reference. Changing by reference is faster and more memory efficient, but is not predictable R style, so is optional.
`verbose`	If TRUE, report progress (only usefull if multiple queries are given)

Details

Apply queries to extract syntax patterns, and add the results as two columns to a tokenlist. One column contains the ids for each hit. The other column contains the annotations. Only nodes that are given a name in the tquery (using the 'label' parameter) will be added as annotation.

Note that while queries only find 1 node for each labeld component of a pattern (e.g., quote queries have 1 node for "source" and 1 node for "quote"), all children of these nodes can be annotated by settting fill to TRUE. If a child has multiple ancestors, only the most direct ancestors are used (see documentation for the fill argument).

Value

The tokenIndex with the annotation columns

Examples

## spacy tokens for: Mary loves John, and Mary was loved by John
tokens = tokens_spacy[tokens_spacy$doc_id == 'text3',]

## two simple example tqueries
passive = tquery(pos = "VERB*", label = "predicate",
                 children(relation = c("agent"), label = "subject"))
active =  tquery(pos = "VERB*", label = "predicate",
                 children(relation = c("nsubj", "nsubjpass"), label = "subject"))

 
tokens = annotate_tqueries(tokens, "clause", pas=passive, act=active)
tokens
if (interactive()) plot_tree(tokens, annotation='clause')

[Package rsyntax version 0.1.4 Index]