DataCleaner {wordpredictor} | R Documentation |
Provides data cleaning functionality
Description
It provides a memory efficient method for removing unneeded characters from text files. It is suitable for cleaning large text files.
Details
It provides a method for cleaning text files. It allows removing bad words, stop words, non dictionary words, extra space, punctuation and non-alphabet characters. It also allows conversion to lower case. It supports large text files.
Super class
wordpredictor::Base
-> DataCleaner
Methods
Public methods
Inherited methods
Method new()
It initializes the current object. It is used to set the file name and verbose options.
Usage
DataCleaner$new(fn = NULL, opts = list(), ve = 0)
Arguments
fn
The path to the file to clean.
opts
The options for data cleaning.
-
min_words. The minimum number of words per sentence.
-
line_count. The number of lines to read and clean at a time.
-
save_data. If the combined processed lines should be saved.
-
output_file. Name of the output file used to store the data.
-
sw_file. The stop words file path.
-
dict_file. The dictionary file path.
-
bad_file. The bad words file path.
-
to_lower. If the words should be converted to lower case.
-
remove_stop. If stop words should be removed.
-
remove_punct. If punctuation symbols should be removed.
-
remove_non_dict. If non dictionary words should be removed.
-
remove_non_alpha. -> If non alphabet symbols should be removed.
-
remove_extra_space. -> If leading, trailing and double spaces should be removed.
-
remove_bad. If bad words should be removed
-
ve
The level of detail in the information messages.
Method clean_file()
It provides an efficient method for cleaning text files. It removes unneeded characters from the given text file with several options.
It allows removing punctuation, bad words, stop words, non-alphabetical symbols and non-dictionary words. It reads a certain number of lines from the given text file. It removes unneeded characters from the lines and then saves the lines to an output text file.
File cleaning progress is displayed if the verbose option was set in the class constructor. It is suitable for cleaning large text files.
Usage
DataCleaner$clean_file()
Examples
# Start of environment setup code # The level of detail in the information messages ve <- 0 # The name of the folder that will contain all the files. It will be # created in the current directory. NULL implies tempdir will be used fn <- NULL # The required files. They are default files that are part of the # package rf <- c("test.txt") # An object of class EnvManager is created em <- EnvManager$new(ve = ve, rp = "./") # The required files are downloaded ed <- em$setup_env(rf, fn) # End of environment setup code # The cleaned test file name cfn <- paste0(ed, "/test-clean.txt") # The test file name fn <- paste0(ed, "/test.txt") # The data cleaning options dc_opts <- list("output_file" = cfn) # The data cleaner object is created dc <- DataCleaner$new(fn, dc_opts, ve = ve) # The sample file is cleaned dc$clean_file() # The test environment is removed. Comment the below line, so the # files generated by the function can be viewed em$td_env()
Method clean_lines()
It cleans the given lines of text using the options passed to the current object.
Usage
DataCleaner$clean_lines(lines)
Arguments
lines
The input sentences.
Returns
The cleaned lines of text.
Examples
# The level of detail in the information messages ve <- 0 # Test data is read l <- c( "If you think I'm wrong, send me a link to where it's happened", "We're about 90percent done with this room", "This isn't how I wanted it between us.", "Almost any cute breed can become ornamental", "Once upon a time there was a kingdom with a castle", "That's not a thing any of us are granted'", "Why are you being so difficult? she asks." ) # The expected results res <- c( "if you think wrong send me a link to where its happened", "were about percent done with this room", "this how i wanted it between us", "almost any cute breed can become ornamental", "once upon a time there was a kingdom with a castle", "thats not a thing any of us are granted", "why are you being so difficult she asks" ) # The DataCleaner object is created dc <- DataCleaner$new(ve = ve) # The line is cleaned cl <- dc$clean_lines(l) # The cleaned lines are printed print(cl)
Method clone()
The objects of this class are cloneable with this method.
Usage
DataCleaner$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `DataCleaner$clean_file`
## ------------------------------------------------
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code
# The cleaned test file name
cfn <- paste0(ed, "/test-clean.txt")
# The test file name
fn <- paste0(ed, "/test.txt")
# The data cleaning options
dc_opts <- list("output_file" = cfn)
# The data cleaner object is created
dc <- DataCleaner$new(fn, dc_opts, ve = ve)
# The sample file is cleaned
dc$clean_file()
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()
## ------------------------------------------------
## Method `DataCleaner$clean_lines`
## ------------------------------------------------
# The level of detail in the information messages
ve <- 0
# Test data is read
l <- c(
"If you think I'm wrong, send me a link to where it's happened",
"We're about 90percent done with this room",
"This isn't how I wanted it between us.",
"Almost any cute breed can become ornamental",
"Once upon a time there was a kingdom with a castle",
"That's not a thing any of us are granted'",
"Why are you being so difficult? she asks."
)
# The expected results
res <- c(
"if you think wrong send me a link to where its happened",
"were about percent done with this room",
"this how i wanted it between us",
"almost any cute breed can become ornamental",
"once upon a time there was a kingdom with a castle",
"thats not a thing any of us are granted",
"why are you being so difficult she asks"
)
# The DataCleaner object is created
dc <- DataCleaner$new(ve = ve)
# The line is cleaned
cl <- dc$clean_lines(l)
# The cleaned lines are printed
print(cl)