cnlp_annotate {cleanNLP} | R Documentation |
Run the annotation pipeline on a set of documents
Description
Runs the clean_nlp annotators over a given corpus of text
using the desired backend. The details for which annotators to run and
how to run them are specified by using one of:
cnlp_init_stringi
, cnlp_init_spacy
, or
cnlp_init_udpipe
.
Usage
cnlp_annotate(
input,
backend = NULL,
verbose = 10,
text_name = "text",
doc_name = "doc_id"
)
Arguments
input |
an object containing the data to parse. Either a
character vector with the texts (optional names can
be given to provide document ids) or a data frame. The
data frame should have a column named 'text' containing
the raw text to parse; if there is a column named
'doc_id', it is treated as a a document identifier.
The name of the text and document id columns can be
changed by setting |
backend |
name of the backend to use. Will default to the last model to be initalized. |
verbose |
set to a positive integer n to display a progress message to display every n'th record. The default is 10. Set to a non-positive integer to turn off messages. Logical input is converted to an integer, so it also possible to set to TRUE (1) to display a message for every document and FALSE (0) to turn off messages. |
text_name |
column name containing the text input. The default
is 'text'. This parameter is ignored when |
doc_name |
column name containing the document ids. The default
is 'doc_id'. This parameter is ignored when
|
Details
The returned object is a named list where each element containing a data frame. The document table contains one row for each document, along with with all of the metadata that was passed as an input. The tokens table has one row for each token detected in the input. The first three columns are always "doc_id" (to index the input document), "sid" (an integer index for the sentence number), and "tid" (an integer index to the specific token). Together, these are a primary key for each row.
Other columns provide extracted data about each token, which differ slightly based on which backend, language, and options are supplied.
-
token: detected token, as given in the original input
-
token_with_ws: detected token along with white space; in, theory, collapsing this field through an entire document will yield the original text
-
lemma: lemmatised version of the token; the exact form depends on the choosen language and backend
-
upos: the universal part of speech code; see https://universaldependencies.org/u/pos/all.html for more information
-
xpos: language dependent part of speech code; the specific categories and their meaning depend on the choosen backend, model and language
-
feats: other extracted linguistic features, typically given as Universal Dependencies (https://universaldependencies.org/u/feat/index.html), but can be model dependent; currently only provided by the udpipe backend
-
tid_source: the token id (tid) of the head word for the dependency relationship starting from this token; for the token attached to the root, this will be given as zero
-
relation: the dependency relation, usually provided using Universal Dependencies (more information available here https://universaldependencies.org/ ), but could be different for a specific model
Value
a named list with components "token", "document" and (when running spacy with NER) "entity".
Author(s)
Taylor B. Arnold, taylor.arnold@acm.org
Examples
cnlp_init_stringi()
cnlp_annotate(un)