R: Parse text into sentences and tokens

sentenceTokenParse {lexRankr}

R Documentation

Parse text into sentences and tokens

Description

Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexRank functions.

Usage

sentenceTokenParse(text, docId = "create", removePunc = TRUE,
  removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
  rmStopWords = TRUE)

Arguments

`text`	A character vector of documents to be parsed into sentences and tokenized.
`docId`	A character vector of document Ids the same length as `text`. If `docId=="create"` document Ids will be created.
`removePunc`	`TRUE` or `FALSE` indicating whether or not to remove punctuation from `text` while tokenizing. If `TRUE`, punctuation will be removed. Defaults to `TRUE`.
`removeNum`	`TRUE` or `FALSE` indicating whether or not to remove numbers from `text` while tokenizing. If `TRUE`, numbers will be removed. Defaults to `TRUE`.
`toLower`	`TRUE` or `FALSE` indicating whether or not to coerce all of `text` to lowercase while tokenizing. If `TRUE`, `text` will be coerced to lowercase. Defaults to `TRUE`.
`stemWords`	`TRUE` or `FALSE` indicating whether or not to stem resulting tokens. If `TRUE`, the outputted tokens will be tokenized using `SnowballC::wordStem()`. Defaults to `TRUE`.
`rmStopWords`	`TRUE`, `FALSE`, or character vector of stopwords to remove from tokens. If `TRUE`, words in `lexRankr::smart_stopwords` will be removed prior to stemming. If `FALSE`, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to `TRUE`.

Value

A list of dataframes. The first element of the list returned is the sentences dataframe; this dataframe has columns docId, sentenceId, & sentence (the actual text of the sentence). The second element of the list returned is the tokens dataframe; this dataframe has columns docId, sentenceId, & token (the actual text of the token).

Examples

sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."),
                   docId=c("d1","d2"))

[Package lexRankr version 0.5.2 Index]