R: Recode words with compound multi-word expressions

txt_recode_ngram {udpipe}

R Documentation

Recode words with compound multi-word expressions

Description

Replace in a character vector of tokens, tokens with compound multi-word expressions. So that c("New", "York") will be c("New York", NA).

Usage

txt_recode_ngram(x, compound, ngram, sep = " ")

Arguments

`x`	a character vector of words where you want to replace tokens with compound multi-word expressions. This is generally a character vector as returned by the token column of `as.data.frame(udpipe_annotate(txt))`
`compound`	a character vector of compound words multi-word expressions indicating terms which can be considered as one word. For example `c('New York', 'Brussels Hoofdstedelijk Gewest')`.
`ngram`	a integer vector of the same length as `compound` indicating how many terms there are in the specific compound multi-word expressions given by `compound`, where `compound[i]` contains `ngram[i]` words. So if `x` is `c('New York', 'Brussels Hoofdstedelijk Gewest')`, the ngram would be `c(2, 3)`
`sep`	separator used when the compounds were constructed by combining the words together into a compound multi-word expression. Defaults to a space: ' '.

Value

the same character vector x where elements in x will be replaced by compound multi-word expression. If will give preference to replacing with compounds with higher ngrams if these occur. See the examples.

Examples

x <- c("I", "went", "to", "New", "York", "City", "on", "holiday", ".")
y <- txt_recode_ngram(x, compound = "New York", ngram = 2, sep = " ")
data.frame(x, y)

keyw <- data.frame(keyword = c("New-York", "New-York-City"), ngram = c(2, 3))
y <- txt_recode_ngram(x, compound = keyw$keyword, ngram = keyw$ngram, sep = "-")
data.frame(x, y)

## Example replacing adjectives followed by a noun with the full compound word
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keyw <- keywords_phrases(x$xpos, term = x$token, pattern = "JJNN", 
                         is_regex = TRUE, detailed = FALSE)
head(keyw)
x$term <- txt_recode_ngram(x$token, compound = keyw$keyword, ngram = keyw$ngram)
head(x[, c("token", "term", "xpos")], 12)