Pretreatment {LilRhino} | R Documentation |
Pretreatment of textual documents for NLP.
Description
This function goes through a number of pretreatment steps in preparation for vectorization. These steps are designed to help the data become more standard so that there are fewer outliers when training during NLP. The following effects are applied: 1. Non-alpha/numerics are removed. 2. Numbers are separated from letters. 3. Numbers are replaced with their word equivalents. 4. Words are stemmed (optional). 5. Words are lowercased (optinal).
Usage
Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)
Arguments
title_vec |
Vector of documents to be pre-treated. |
stem |
Boolian variable to decide whether to stem or not. |
lower |
Boolian variable to decide whether to lowercase words or not. |
parallel |
Boolian variable to decide whether to run this function in parallel or not. |
Details
This function returns a list. It should be able to accept any format that the function lapply would accept. The parallelization is done with the function Mcapply from the package 'parallel' and will only work on systems that allow forking (Sorry windows users). Future updates will allow for socketing.
Value
output |
The list of character strings post-pretreatment |
Author(s)
Travis Barton
Examples
## Not run: # for some reason it takes longer than 5 seconds on CRAN's computers
test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!')
res = Pretreatment(test_vec)
print(res)
## End(Not run)