R: Remove Words through Speech Tagging

slim_text {chinese.misc}

R Documentation

Remove Words through Speech Tagging

Description

The function calls jiebaR::tagging to do speech tagging on a Chinese text, and then removes words that have certain tags.

Usage

slim_text(
  x,
  mycutter = DEFAULT_cutter,
  rm_place = TRUE,
  rm_time = TRUE,
  rm_eng = FALSE,
  rm_alpha = FALSE,
  paste = TRUE
)

Arguments

`x`	a length 1 character of Chinese text to be tagged
`mycutter`	a jiebar cutter provided by users to tag text. It has a default value, see Details.
`rm_place`	`TRUE` or `FALSE`. if `TRUE` (default), words related to a specified place ("ns") are removed.
`rm_time`	`TRUE` or `FALSE`. if `TRUE` (default), time related words ("t") are removed.
`rm_eng`	`TRUE` or `FALSE`. if `TRUE`, English words are removed. The default is `FALSE`.
`rm_alpha`	should be "any", `TRUE` or `FALSE` (default). Some English words are tagged as "x", so cannot be remove by setting `rm_eng`. But when `rm_alpha` is `TRUE`, any word that contains only a-zA-Z will be removed. If it is "any", then words that are mixtures of a-zA-Z and Chinese/digits will be removed.
`paste`	`TRUE` or `FALSE`, whether to paste the segmented words together into a length 1 character. The default is `TRUE`.

Details

Stop words are often removed from texts. But a stop word list hardly includes all words that need to be removed. So, before removing stop words, we can remove a lot of insignificant words by tagging and make the texts "slim". The webpage http://www.docin.com/p-341417726.html?_t_t_t=0.3930890985844252 provides details about Chinese word tags.

Only words with the following tags are to be preserved:

(1) "n": nouns;
(2) "t": time related words;
(3) "s": space related words;
(4) "v": verbs;
(5) "a": adjectives;
(6) "b": words only used as attributes in Chinese;
(7) "x": strings;
(8) "j", "l", "i", "z": some specific Chinese letters and phrases;
(9) "unknown": words of unknown type;
(10) "eng": English words.

Optionally, words related to a specified place ("ns"), time related words ("t") and english words ("eng") can be removed.

By default, a DEFAULT_cutter is used by the mycutter argument, which is assigned as worker(write = FALSE) when loading the package. As long as you have not manually created another variable called "DEFAULT_cutter", you can directly use jiebaR::new_user_word(DEFAULT_cutter...) to add new words. By the way, whether you manually create an object called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is used by default by functions in this package will not be removed by you. So, whenever you want to use this default value, you just do not set mycutter.

Value

a length 1 character of segmented text, or a character vector, each element of which is a word.

Examples


require(jiebaR)
cutter <- jiebaR::worker()
# Give some English words a new tag.
new_user_word(cutter, c("aaa", "bbb", "ccc"),  rep("x", 3))
x <- "we have new words: aaa, bbb, ccc."
# The default is to keep English words.
slim_text(x, mycutter = cutter)
# Remove words tagged as "eng" but others are kept.
slim_text(x, mycutter = cutter, rm_eng = TRUE)
# Remove any word that only has a-zA-Z, 
# even when rm_eng = FALSE.
slim_text(x, mycutter = cutter, rm_eng = TRUE, rm_alpha = TRUE)
slim_text(x, mycutter = cutter, rm_eng = FALSE, rm_alpha = TRUE)

[Package chinese.misc version 0.2.3 Index]