R: Text cleaning specific for input to word2vec

txt_clean_word2vec {word2vec}

R Documentation

Text cleaning specific for input to word2vec

Description

Standardise text by

Conversion of text from UTF-8 to ASCII
Keeping only alphanumeric characters: letters and numbers
Removing multiple spaces
Removing leading/trailing spaces
Performing lowercasing

Usage

txt_clean_word2vec(x, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)

Arguments

`x`	a character vector in UTF-8 encoding
`ascii`	logical indicating to use `iconv` to convert the input from UTF-8 to ASCII. Defaults to TRUE.
`alpha`	logical indicating to keep only alphanumeric characters. Defaults to TRUE.
`tolower`	logical indicating to lowercase `x`. Defaults to TRUE.
`trim`	logical indicating to trim leading/trailing white space. Defaults to TRUE.

Value

a character vector of the same length as x which is standardised by converting the encoding to ascii, lowercasing and keeping only alphanumeric elements

Examples

x <- c("  Just some.texts,  ok?", "123.456 and\tsome MORE!  ")
txt_clean_word2vec(x)

[Package word2vec version 0.4.0 Index]