R: Text Cleaner

textcleaner {SemNetCleaner}

R Documentation

Text Cleaner

Description

An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data

Usage

textcleaner(
  data = NULL,
  miss = 99,
  partBY = c("row", "col"),
  dictionary = NULL,
  spelling = c("UK", "US"),
  add.path = NULL,
  keepStrings = FALSE,
  allowPunctuations = c("-", "all"),
  allowNumbers = FALSE,
  lowercase = TRUE,
  continue = NULL
)

Arguments

`data`	Matrix or data frame. A dataset of text data. Participant IDs will be automatically identified if they are included. If no IDs are provided, then their order in the corresponding row (or column is used). A message will notify the user how IDs were assigned
`miss`	Numeric or character. Value for missing data. Defaults to `99`
`partBY`	Character. Are participants by row or column? Set to `"row"` for by row. Set to `"col"` for by column
`dictionary`	Character vector. Can be a vector of a corpus or any text for comparison. Dictionary to be used for more efficient text cleaning. Defaults to `NULL`, which will use `general.dictionary` Use `dictionaries()` or `find.dictionaries()` for more options (See `SemNetDictionaries` for more details)
`spelling`	Character vector. English spelling to be used. `"UK"` For British spelling (e.g., colour, grey, programme, theatre) `"US"` For American spelling (e.g., color, gray, program, theater)
`add.path`	Character. Path to additional dictionaries to be found. DOES NOT search recursively (through all folders in path) to avoid time intensive search. Set to `"choose"` to open an interactive directory explorer
`keepStrings`	Boolean. Should strings be retained or separated? Defaults to `FALSE`. Set to `TRUE` to retain strings as strings
`allowPunctuations`	Character vector. Allows punctuation characters to be included in responses. Defaults to `"-"`. Set to `"all"` to keep all punctuation characters
`allowNumbers`	Boolean. Defaults to `FALSE`. Set to `TRUE` to keep numbers in text
`lowercase`	Boolean. Should words be converted to lowercase? Defaults to `TRUE`. Set to `FALSE` to keep words as they are
`continue`	List. A result previously unfinished that still needs to be completed. Allows you to continue to manually spell-check their data after you've closed or errored out. Defaults to `NULL`

Value

This function returns a list containing the following objects:

`binary`	A matrix of responses where each row represents a participant and each column represents a unique response. A response that a participant has provided is a '`1`' and a response that a participant has not provided is a '`0`'
`responses`	A list containing two objects: `clean` A response matrix that has been spell-checked and de-pluralized with duplicates removed. This can be used as a final dataset for analyses (e.g., fluency of responses) `original` The original response matrix that has had white spaces before and after words response. Also converts all upper-case letters to lower case
`spellcheck`	A list containing three objects: `full` All responses regardless of spell-checking changes `auto` Only the incorrect responses that were changed during spell-check
`removed`	A list containing two objects: `rows` Identifies removed participants by their row (or column) location in the original data file `ids` Identifies removed participants by their ID (see argument `data`)
`partChanges`	A list where each participant is a list index with each response that was been changed. Participants are identified by their ID (see argument `data`). This can be used to replicate the cleaning process and to keep track of changes more generally. Participants with `NA` did not have any changes from their original data and participants with missing data are removed (see `removed$ids`)

Author(s)

Alexander Christensen <alexpaulchristensen@gmail.com>

References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28.

Examples

# Toy example
raw <- open.animals[c(1:10),-c(1:3)]

if(interactive())
{
    #Full test
    clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals")
}

[Package SemNetCleaner version 1.3.4 Index]