slim_text {chinese.misc} | R Documentation |
Remove Words through Speech Tagging
Description
The function calls jiebaR::tagging
to do speech tagging on a Chinese text, and then
removes words that have certain tags.
Usage
slim_text(
x,
mycutter = DEFAULT_cutter,
rm_place = TRUE,
rm_time = TRUE,
rm_eng = FALSE,
rm_alpha = FALSE,
paste = TRUE
)
Arguments
x |
a length 1 character of Chinese text to be tagged |
mycutter |
a jiebar cutter provided by users to tag text. It has a default value, see Details. |
rm_place |
|
rm_time |
|
rm_eng |
|
rm_alpha |
should be "any", |
paste |
|
Details
Stop words are often removed from texts. But a stop word list hardly includes all words that need to be removed. So, before removing stop words, we can remove a lot of insignificant words by tagging and make the texts "slim". The webpage http://www.docin.com/p-341417726.html?_t_t_t=0.3930890985844252 provides details about Chinese word tags.
Only words with the following tags are to be preserved:
(1) "n": nouns;
(2) "t": time related words;
(3) "s": space related words;
(4) "v": verbs;
(5) "a": adjectives;
(6) "b": words only used as attributes in Chinese;
(7) "x": strings;
(8) "j", "l", "i", "z": some specific Chinese letters and phrases;
(9) "unknown": words of unknown type;
(10) "eng": English words.
Optionally, words related to a specified place ("ns"), time related words ("t") and english words ("eng") can be removed.
By default, a DEFAULT_cutter
is used by the mycutter
argument, which is
assigned as worker(write = FALSE)
when loading the package.
As long as
you have not manually created another variable called "DEFAULT_cutter",
you can directly use jiebaR::new_user_word(DEFAULT_cutter...)
to add new words. By the way, whether you manually create an object
called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is
used by default by functions in this package will not be removed by you.
So, whenever you want to use this default value, you just do not set
mycutter
.
Value
a length 1 character of segmented text, or a character vector, each element of which is a word.
Examples
require(jiebaR)
cutter <- jiebaR::worker()
# Give some English words a new tag.
new_user_word(cutter, c("aaa", "bbb", "ccc"), rep("x", 3))
x <- "we have new words: aaa, bbb, ccc."
# The default is to keep English words.
slim_text(x, mycutter = cutter)
# Remove words tagged as "eng" but others are kept.
slim_text(x, mycutter = cutter, rm_eng = TRUE)
# Remove any word that only has a-zA-Z,
# even when rm_eng = FALSE.
slim_text(x, mycutter = cutter, rm_eng = TRUE, rm_alpha = TRUE)
slim_text(x, mycutter = cutter, rm_eng = FALSE, rm_alpha = TRUE)