removePrefixes {arabicStemR} | R Documentation |
Remove Arabic prefixes
Description
Removes some Arabic prefixes from a unicode string. The prefixes are: "waw", "alif-lam", "waw-alif-lam", "ba-alif-lam", "kaf-alif-lam", "fa-alif-lam", and "lam-lam." Prefixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short.
Usage
removePrefixes(texts, x1 = 4, x2 = 4, x3 = 5, x4 = 5, x5 = 5, x6 = 5, x7 = 4,
dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))
Arguments
texts |
An Arabic-language string in unicode |
x1 |
The number of letters that must be in a word for the function to remove the prefix "waw". |
x2 |
The number of letters that must be in a word for the function to remove the prefix "alif-lam". |
x3 |
The number of letters that must be in a word for the function to remove the prefix "waw-alif-lam". |
x4 |
The number of letters that must be in a word for the function to remove the prefix "ba-alif-lam". |
x5 |
The number of letters that must be in a word for the function to remove the prefix "kaf-alif-lam". |
x6 |
The number of letters that must be in a word for the function to remove the prefix "fa-alif-lam". |
x7 |
The number of letters that must be in a word for the function to remove the prefix "lam-lam". |
dontstem |
Words that should not be stemmed (entered in unicode). |
Value
Returns a string with Arabic prefixes removed.
Author(s)
Rich Nielsen
Examples
## Create string with Arabic characters
x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629
\u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627'
# Remove Prefixes
removePrefixes(x)