textcleaner {SemNetCleaner} | R Documentation |
Text Cleaner
Description
An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data
Usage
textcleaner(
data = NULL,
miss = 99,
partBY = c("row", "col"),
dictionary = NULL,
spelling = c("UK", "US"),
add.path = NULL,
keepStrings = FALSE,
allowPunctuations = c("-", "all"),
allowNumbers = FALSE,
lowercase = TRUE,
continue = NULL
)
Arguments
data |
Matrix or data frame. A dataset of text data. Participant IDs will be automatically identified if they are included. If no IDs are provided, then their order in the corresponding row (or column is used). A message will notify the user how IDs were assigned |
miss |
Numeric or character.
Value for missing data.
Defaults to |
partBY |
Character.
Are participants by row or column?
Set to |
dictionary |
Character vector.
Can be a vector of a corpus or any text for comparison.
Dictionary to be used for more efficient text cleaning.
Defaults to Use |
spelling |
Character vector. English spelling to be used.
|
add.path |
Character.
Path to additional dictionaries to be found.
DOES NOT search recursively (through all folders in path)
to avoid time intensive search.
Set to |
keepStrings |
Boolean.
Should strings be retained or separated?
Defaults to |
allowPunctuations |
Character vector.
Allows punctuation characters to be included in responses.
Defaults to |
allowNumbers |
Boolean.
Defaults to |
lowercase |
Boolean.
Should words be converted to lowercase?
Defaults to |
continue |
List.
A result previously unfinished that still needs to be completed.
Allows you to continue to manually spell-check their data
after you've closed or errored out.
Defaults to |
Value
This function returns a list containing the following objects:
binary |
A matrix of responses where each row represents a participant
and each column represents a unique response. A response that a participant has provided is a ' |
responses |
A list containing two objects:
|
spellcheck |
A list containing three objects:
|
removed |
A list containing two objects:
|
partChanges |
A list where each participant is a list index with each
response that was been changed. Participants are identified by their ID (see argument |
Author(s)
Alexander Christensen <alexpaulchristensen@gmail.com>
References
Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28.
Examples
# Toy example
raw <- open.animals[c(1:10),-c(1:3)]
if(interactive())
{
#Full test
clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals")
}