R: Apache OpenNLP based word token annotators

Maxent_Word_Token_Annotator {openNLP}

R Documentation

Apache OpenNLP based word token annotators

Description

Generate an annotator which computes word token annotations using the Apache OpenNLP Maxent tokenizer.

Usage

Maxent_Word_Token_Annotator(language = "en", probs = FALSE, model = NULL)

Arguments

`language`	a character string giving the ISO-639 code of the language being processed by the annotator.
`probs`	a logical indicating whether the computed annotations should provide the token probabilities obtained from the Maxent model as their ‘prob’ feature.
`model`	a character string giving the path to the Maxent model file to be used, or `NULL` indicating to use a default model file for the given language (if available, see Details).

Details

See http://opennlp.sourceforge.net/models-1.5/ for available model files. For languages other than English, these can conveniently be made available to R by installing the respective openNLPmodels.language package from the repository at https://datacube.wu.ac.at. For English, no additional installation is required.

Value

An Annotator object giving the generated word token annotator.

Examples

require("NLP")
## Some text.
s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ",
             "nonexecutive director Nov. 29.\n",
             "Mr. Vinken is chairman of Elsevier N.V., ",
             "the Dutch publishing group."),
           collapse = "")
s <- as.String(s)

## Need sentence token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
a1 <- annotate(s, sent_token_annotator)

word_token_annotator <- Maxent_Word_Token_Annotator()
word_token_annotator
a2 <- annotate(s, word_token_annotator, a1)
a2
## Variant with word token probabilities as features.
head(annotate(s, Maxent_Word_Token_Annotator(probs = TRUE), a1))

## Can also perform sentence and word token annotations in a pipeline:
a <- annotate(s, list(sent_token_annotator, word_token_annotator))
head(a)

[Package openNLP version 0.2-7 Index]