get_tag_word {chinese.misc} | R Documentation |
Extract Words of Some Certain Tags through Pos-Tagging
Description
Given a group of Chinese texts, this function manages to extract words of some specified types. For example, sometimes
you want to collect all verbs that are used in your texts. Note: this function uses jiebaR::tagging
to segment
texts and do pos-tagging. The types assigned are not all correct. So, alternatively, you can first pos-tag your texts with
other methods and then use this function.
Usage
get_tag_word(
x,
tag = NULL,
tag_pattern = NULL,
mycutter = DEFAULT_cutter,
type = "word",
each = TRUE,
only_unique = FALSE,
keep_name = FALSE,
checks = TRUE
)
Arguments
x |
it must be a list of character vectors, even when the list contains only one element.
Each element of the list is either a length 1 character vector of a text, or
a length >= 1 character vector which is the result of former tagging work. It should not contain |
tag |
one or more tags should be specified. Words with these tags will be chosen. Possible tags are "v", "n", "vn", etc. |
tag_pattern |
should be a length 1 regular expression. You can specify tags by this pattern rather than directly
provide tag names. For example, you can specify tag names starting with "n" by |
mycutter |
a cutter created with package jiebaR and
given by users to tag texts. If your texts have already been pos-tagged, you
can set this to |
type |
if it is "word" (default), then extract the words that match your tags. If it is "position", only the positions
of the words are returned. Note: if it is "positions", argument |
each |
if this is |
only_unique |
if it is |
keep_name |
whether to keep the tag names of the extracted words. The default is |
checks |
whether to check the correctness of arguments. The default is |
Details
The Argument each and only_unique decide what kind of return you can get.
if
each = TRUE
andonly_unique = FALSE
, you can get a list, each element of which contains words extracted. This is the default.if
each = TRUE
andonly_unique = TRUE
, each element of the list only contains unique words.if
each = FALSE
andonly_unique = FALSE
, all words extracted will be put into a single vector.if
each = FALSE
andonly_unique = TRUE
, words extracted will be put into a single vector, but only unique words will be returned.
Examples
# No Chinese, so use English instead.
x1 <- c(v = "drink", xdrink = "coffee", v = "drink", xdrink = "cola", v = "eat", xfood = "banana")
x2 <- c(v = "drink", xdrink = "tea", v = "buy", x = "computer")
x <- list(x1, x2)
get_tag_word(x, tag = "v", mycutter = NULL)
get_tag_word(x, tag = "v", mycutter = NULL, only_unique = TRUE)
get_tag_word(x, tag_pattern = "^x", mycutter = NULL)
get_tag_word(x, tag_pattern = "^x", mycutter = NULL, keep_name = TRUE)
get_tag_word(x, tag = "v", mycutter = NULL, each = FALSE)
get_tag_word(x, tag = "v", mycutter = NULL, each = FALSE, only_unique = TRUE)
get_tag_word(x, tag = "v", mycutter = NULL, type = "position")