sentenceTokenParse {lexRankr} | R Documentation |
Parse text into sentences and tokens
Description
Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexRank
functions.
Usage
sentenceTokenParse(text, docId = "create", removePunc = TRUE,
removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
rmStopWords = TRUE)
Arguments
text |
A character vector of documents to be parsed into sentences and tokenized. |
docId |
A character vector of document Ids the same length as |
removePunc |
|
removeNum |
|
toLower |
|
stemWords |
|
rmStopWords |
|
Value
A list of dataframes. The first element of the list returned is the sentences
dataframe; this dataframe has columns docId
, sentenceId
, & sentence
(the actual text of the sentence). The second element of the list returned is the tokens
dataframe; this dataframe has columns docId
, sentenceId
, & token
(the actual text of the token).
Examples
sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."),
docId=c("d1","d2"))