R: Pretreatment of textual documents for NLP.

Pretreatment {LilRhino}

R Documentation

Pretreatment of textual documents for NLP.

Description

This function goes through a number of pretreatment steps in preparation for vectorization. These steps are designed to help the data become more standard so that there are fewer outliers when training during NLP. The following effects are applied: 1. Non-alpha/numerics are removed. 2. Numbers are separated from letters. 3. Numbers are replaced with their word equivalents. 4. Words are stemmed (optional). 5. Words are lowercased (optinal).

Usage

Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)

Arguments

`title_vec`	Vector of documents to be pre-treated.
`stem`	Boolian variable to decide whether to stem or not.
`lower`	Boolian variable to decide whether to lowercase words or not.
`parallel`	Boolian variable to decide whether to run this function in parallel or not.

Details

This function returns a list. It should be able to accept any format that the function lapply would accept. The parallelization is done with the function Mcapply from the package 'parallel' and will only work on systems that allow forking (Sorry windows users). Future updates will allow for socketing.

Value

output

The list of character strings post-pretreatment

Author(s)

Travis Barton

Examples

## Not run:  # for some reason it takes longer than 5 seconds on CRAN's computers
test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!')
res = Pretreatment(test_vec)
print(res)

## End(Not run)

[Package LilRhino version 1.2.2 Index]