stem {arabicStemR} | R Documentation |
Arabic Stemmer for Text Analysis
Description
Allows users to stem Arabic texts for text analysis. Now deprecated. Please use stemArabic.
Usage
stem(dat, cleanChars = TRUE, cleanLatinChars = TRUE,
transliteration = TRUE, returnStemList = FALSE,
defaultStopwordList=TRUE, customStopwordList=NULL,
dontStemTheseWords = c("allh", "llh"))
Arguments
dat |
The original data, as a vector of length one containing the text. |
cleanChars |
Removes all unicode characters except Latin characters and Arabic alphabet |
cleanLatinChars |
Removes Latin characters |
transliteration |
Transliterates the text |
returnStemList |
Performs stemming by removing prefixes and suffixes |
defaultStopwordList |
If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE. |
customStopwordList |
Optional user-specified stopword list of words to be removed, supplied as a vector of strings in either Arabic UTF-8 or Latin characters following the stemmer's transliteration scheme (words without Arabic UTF-8 characters are processed with reverse.transliterate()). Default is NULL. |
dontStemTheseWords |
Optional vector of strings that should not be stemmed. These words can be supplied as transliterated Arabic (according to the transliteration scheme of transliterate() and reverse.transliterate()) or in unicode Arabic. If a term matches an element of this argument at any intermediate point in stemming, that term will not be stemmed further. The default is c("allh","llh") because in most applications, stemming these common words for "God" creates some confusion by resulting in the string "lh". |
Details
stem
prepares texts in Arabic for text analysis by stemming.
Value
stem
returns a named list with the following elements:
text |
The stemmed text |
stemlist |
A list of the stemmed words. |
Author(s)
Rich Nielsen
Examples
## generate some text in Arabic
x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647
\u0627\u0644\u0631\u062D\u0645\u0646
\u0627\u0644\u0631\u062D\u064A\u0645"
## stem and transliterate
## NOTE: the "stem()" function only accepts a vector of length 1.
## The function is deprecated in favor of stemArabic() which accepts vectors with multiple elements.
stem(x)
## stem while not stemming certain words
stem(x, dontStemTheseWords = c("alr7mn"))
## stem and return the stemlist
out <- stem(x,returnStemList=TRUE)
out$text
out$stemlist