slim_text {chinese.misc}R Documentation

Remove Words through Speech Tagging

Description

The function calls jiebaR::tagging to do speech tagging on a Chinese text, and then removes words that have certain tags.

Usage

slim_text(
  x,
  mycutter = DEFAULT_cutter,
  rm_place = TRUE,
  rm_time = TRUE,
  rm_eng = FALSE,
  rm_alpha = FALSE,
  paste = TRUE
)

Arguments

x

a length 1 character of Chinese text to be tagged

mycutter

a jiebar cutter provided by users to tag text. It has a default value, see Details.

rm_place

TRUE or FALSE. if TRUE (default), words related to a specified place ("ns") are removed.

rm_time

TRUE or FALSE. if TRUE (default), time related words ("t") are removed.

rm_eng

TRUE or FALSE. if TRUE, English words are removed. The default is FALSE.

rm_alpha

should be "any", TRUE or FALSE (default). Some English words are tagged as "x", so cannot be remove by setting rm_eng. But when rm_alpha is TRUE, any word that contains only a-zA-Z will be removed. If it is "any", then words that are mixtures of a-zA-Z and Chinese/digits will be removed.

paste

TRUE or FALSE, whether to paste the segmented words together into a length 1 character. The default is TRUE.

Details

Stop words are often removed from texts. But a stop word list hardly includes all words that need to be removed. So, before removing stop words, we can remove a lot of insignificant words by tagging and make the texts "slim". The webpage http://www.docin.com/p-341417726.html?_t_t_t=0.3930890985844252 provides details about Chinese word tags.

Only words with the following tags are to be preserved:

Optionally, words related to a specified place ("ns"), time related words ("t") and english words ("eng") can be removed.

By default, a DEFAULT_cutter is used by the mycutter argument, which is assigned as worker(write = FALSE) when loading the package. As long as you have not manually created another variable called "DEFAULT_cutter", you can directly use jiebaR::new_user_word(DEFAULT_cutter...) to add new words. By the way, whether you manually create an object called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is used by default by functions in this package will not be removed by you. So, whenever you want to use this default value, you just do not set mycutter.

Value

a length 1 character of segmented text, or a character vector, each element of which is a word.

Examples


require(jiebaR)
cutter <- jiebaR::worker()
# Give some English words a new tag.
new_user_word(cutter, c("aaa", "bbb", "ccc"),  rep("x", 3))
x <- "we have new words: aaa, bbb, ccc."
# The default is to keep English words.
slim_text(x, mycutter = cutter)
# Remove words tagged as "eng" but others are kept.
slim_text(x, mycutter = cutter, rm_eng = TRUE)
# Remove any word that only has a-zA-Z, 
# even when rm_eng = FALSE.
slim_text(x, mycutter = cutter, rm_eng = TRUE, rm_alpha = TRUE)
slim_text(x, mycutter = cutter, rm_eng = FALSE, rm_alpha = TRUE)


[Package chinese.misc version 0.2.3 Index]