R: Parse a text using spaCy

spacy_parse {spacyr}

R Documentation

Parse a text using spaCy

Description

The spacy_parse() function calls spaCy to both tokenize and tag the texts, and returns a data.table of the results. The function provides options on the types of tagsets (tagset_ options) either "google" or "detailed", as well as lemmatization (lemma). It provides a functionalities of dependency parsing and named entity recognition as an option. If "full_parse = TRUE" is provided, the function returns the most extensive list of the parsing results from spaCy.

Usage

spacy_parse(
  x,
  pos = TRUE,
  tag = FALSE,
  lemma = TRUE,
  entity = TRUE,
  dependency = FALSE,
  nounphrase = FALSE,
  multithread = TRUE,
  additional_attributes = NULL,
  ...
)

Arguments

`x`	a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif)
`pos`	logical whether to return universal dependency POS tagset https://universaldependencies.org/u/pos/)
`tag`	logical whether to return detailed part-of-speech tags, for the language model `en`, it uses the OntoNotes 5 version of the Penn Treebank tag set (https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation specifications for other available languages are available on the spaCy website (https://spacy.io/api/annotation).
`lemma`	logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models)
`entity`	logical; if `TRUE`, report named entities
`dependency`	logical; if `TRUE`, analyse and tag dependencies
`nounphrase`	logical; if `TRUE`, analyse and tag noun phrases tags
`multithread`	logical; If `TRUE`, the processing is parallelized using spaCy's architecture (https://spacy.io/api)
`additional_attributes`	a character vector; this option is for extracting additional attributes of tokens from spaCy. When the names of attributes are supplied, the output data.frame will contain additional variables corresponding to the names of the attributes. For instance, when `additional_attributes = c("is_punct")`, the output will include an additional variable named `is_punct`, which is a Boolean (in R, logical) variable indicating whether the token is a punctuation. A full list of available attributes is available from https://spacy.io/api/token#attributes.
`...`	not used directly

Value

a data.frame of tokenized, parsed, and annotated tokens

Examples

## Not run: 
spacy_initialize()
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
spacy_parse(txt, pos = TRUE, tag = TRUE)
spacy_parse(txt, dependency = TRUE)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)

txt3 <- "We analyzed the Supreme Court with three natural language processing tools." 
spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))

## End(Not run)

[Package spacyr version 1.3.0 Index]