cnlp_annotate {cleanNLP}R Documentation

Run the annotation pipeline on a set of documents

Description

Runs the clean_nlp annotators over a given corpus of text using the desired backend. The details for which annotators to run and how to run them are specified by using one of: cnlp_init_stringi, cnlp_init_spacy, cnlp_init_udpipe, or cnlp_init_corenlp.

Usage

cnlp_annotate(
  input,
  backend = NULL,
  verbose = 10,
  text_name = "text",
  doc_name = "doc_id"
)

Arguments

input

an object containing the data to parse. Either a character vector with the texts (optional names can be given to provide document ids) or a data frame. The data frame should have a column named 'text' containing the raw text to parse; if there is a column named 'doc_id', it is treated as a a document identifier. The name of the text and document id columns can be changed by setting text_name and doc_name This conforms with corpus objects respecting the Text Interchange Format (TIF), while allowing for some variation.

backend

name of the backend to use. Will default to the last model to be initalized.

verbose

set to a positive integer n to display a progress message to display every n'th record. The default is 10. Set to a non-positive integer to turn off messages. Logical input is converted to an integer, so it also possible to set to TRUE (1) to display a message for every document and FALSE (0) to turn off messages.

text_name

column name containing the text input. The default is 'text'. This parameter is ignored when input is a character vector.

doc_name

column name containing the document ids. The default is 'doc_id'. This parameter is ignored when input is a character vector.

Details

The returned object is a named list where each element containing a data frame. The document table contains one row for each document, along with with all of the metadata that was passed as an input. The tokens table has one row for each token detected in the input. The first three columns are always "doc_id" (to index the input document), "sid" (an integer index for the sentence number), and "tid" (an integer index to the specific token). Together, these are a primary key for each row.

Other columns provide extracted data about each token, which differ slightly based on which backend, language, and options are supplied.

Value

a named list with components "token", "document" and (when running spacy with NER) "entity".

Author(s)

Taylor B. Arnold, taylor.arnold@acm.org

Examples

cnlp_init_stringi()
cnlp_annotate(un)


[Package cleanNLP version 3.0.7 Index]