TokenGenerator {wordpredictor} | R Documentation |
Generates n-grams from text files
Description
It generates n-gram tokens along with their frequencies. The data may be saved to a file in plain text format or as a R object.
Super class
wordpredictor::Base
-> TokenGenerator
Methods
Public methods
Inherited methods
Method new()
It initializes the current obj. It is used to set the file name, tokenization options and verbose option.
Usage
TokenGenerator$new(fn = NULL, opts = list(), ve = 0)
Arguments
fn
The path to the input file.
opts
The options for generating the n-gram tokens.
-
n. The n-gram size.
-
save_ngrams. If the n-gram data should be saved.
-
min_freq. All n-grams with frequency less than min_freq are ignored.
-
line_count. The number of lines to process at a time.
-
stem_words. If words should be transformed to their stems.
-
dir. The dir where the output file should be saved.
-
format. The format for the output. There are two options.
-
plain. The data is stored in plain text.
-
obj. The data is stored as a R obj.
-
-
ve
The level of detail in the information messages.
Method generate_tokens()
It generates n-gram tokens and their frequencies from the given file name. The tokens may be saved to a text file as plain text or a R object.
Usage
TokenGenerator$generate_tokens()
Returns
The data frame containing n-gram tokens along with their frequencies.
Examples
# Start of environment setup code # The level of detail in the information messages ve <- 0 # The name of the folder that will contain all the files. It will be # created in the current directory. NULL implies tempdir will be used fn <- NULL # The required files. They are default files that are part of the # package rf <- c("test-clean.txt") # An object of class EnvManager is created em <- EnvManager$new(ve = ve, rp = "./") # The required files are downloaded ed <- em$setup_env(rf, fn) # End of environment setup code # The n-gram size n <- 4 # The test file name tfn <- paste0(ed, "/test-clean.txt") # The n-gram number is set tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed) # The TokenGenerator object is created tg <- TokenGenerator$new(tfn, tg_opts, ve = ve) # The n-gram tokens are generated tg$generate_tokens() # The test environment is removed. Comment the below line, so the # files generated by the function can be viewed em$td_env()
Method clone()
The objects of this class are cloneable with this method.
Usage
TokenGenerator$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `TokenGenerator$generate_tokens`
## ------------------------------------------------
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code
# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()