mallet.import {mallet} | R Documentation |
Import text documents into Mallet format
Description
This function takes an array of document IDs and text files (as character strings) and converts them into a Mallet instance list.
Usage
mallet.import(
id.array = NULL,
text.array,
stoplist = "",
preserve.case = FALSE,
token.regexp = "[\\p{L}]+"
)
Arguments
id.array |
An array of document IDs. Default is |
text.array |
A character vector with each element containing a document. |
stoplist |
The name of a file containing stopwords (words to ignore), one per line, or a character vector containing stop words. If the file is not in the current working directory, you may need to include a full path. Default is no stoplist. |
preserve.case |
By default, the input text is converted to all lowercase. |
token.regexp |
A quoted string representing a regular expression that defines a token. The default is one or more unicode letter: "[\\p{L}]+". Note that special characters must have double backslashes. |
Value
a cc/mallet/types/InstanceList
object.
See Also
mallet.word.freqs
returns term and document frequencies, which may be useful in selecting stopwords.
Examples
## Not run:
# Read in sotu example data
data(sotu)
sotu.instances <-
mallet.import(id.array = row.names(sotu),
text.array = sotu[["text"]],
stoplist = mallet_stoplist_file_path("en"),
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
## End(Not run)