txt_recode_ngram {udpipe} | R Documentation |
Recode words with compound multi-word expressions
Description
Replace in a character vector of tokens, tokens with compound multi-word expressions.
So that c("New", "York")
will be c("New York", NA)
.
Usage
txt_recode_ngram(x, compound, ngram, sep = " ")
Arguments
x |
a character vector of words where you want to replace tokens with compound multi-word expressions.
This is generally a character vector as returned by the token column of |
compound |
a character vector of compound words multi-word expressions indicating terms which can be considered as one word.
For example |
ngram |
a integer vector of the same length as |
sep |
separator used when the compounds were constructed by combining the words together into a compound multi-word expression. Defaults to a space: ' '. |
Value
the same character vector x
where elements in x
will be replaced by compound multi-word expression.
If will give preference to replacing with compounds with higher ngrams if these occur. See the examples.
See Also
Examples
x <- c("I", "went", "to", "New", "York", "City", "on", "holiday", ".")
y <- txt_recode_ngram(x, compound = "New York", ngram = 2, sep = " ")
data.frame(x, y)
keyw <- data.frame(keyword = c("New-York", "New-York-City"), ngram = c(2, 3))
y <- txt_recode_ngram(x, compound = keyw$keyword, ngram = keyw$ngram, sep = "-")
data.frame(x, y)
## Example replacing adjectives followed by a noun with the full compound word
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keyw <- keywords_phrases(x$xpos, term = x$token, pattern = "JJNN",
is_regex = TRUE, detailed = FALSE)
head(keyw)
x$term <- txt_recode_ngram(x$token, compound = keyw$keyword, ngram = keyw$ngram)
head(x[, c("token", "term", "xpos")], 12)