R: Extract Words of Some Certain Tags through Pos-Tagging

get_tag_word {chinese.misc}

R Documentation

Extract Words of Some Certain Tags through Pos-Tagging

Description

Given a group of Chinese texts, this function manages to extract words of some specified types. For example, sometimes you want to collect all verbs that are used in your texts. Note: this function uses jiebaR::tagging to segment texts and do pos-tagging. The types assigned are not all correct. So, alternatively, you can first pos-tag your texts with other methods and then use this function.

Usage

get_tag_word(
  x,
  tag = NULL,
  tag_pattern = NULL,
  mycutter = DEFAULT_cutter,
  type = "word",
  each = TRUE,
  only_unique = FALSE,
  keep_name = FALSE,
  checks = TRUE
)

Arguments

`x`	it must be a list of character vectors, even when the list contains only one element. Each element of the list is either a length 1 character vector of a text, or a length >= 1 character vector which is the result of former tagging work. It should not contain `NA`.
`tag`	one or more tags should be specified. Words with these tags will be chosen. Possible tags are "v", "n", "vn", etc.
`tag_pattern`	should be a length 1 regular expression. You can specify tags by this pattern rather than directly provide tag names. For example, you can specify tag names starting with "n" by `tag_pattern = "^n"`. At least and at most one of tag and tag_pattern should be `NULL`.
`mycutter`	a cutter created with package jiebaR and given by users to tag texts. If your texts have already been pos-tagged, you can set this to `NULL`. By default, a `DEFAULT_cutter` is used, which is assigned as `worker(write = FALSE)` when loading the package.
`type`	if it is "word" (default), then extract the words that match your tags. If it is "position", only the positions of the words are returned. Note: if it is "positions", argument `each` (see below) will always be set to `TRUE`.
`each`	if this is `TRUE` (default), the return will be a list, each element of which is a extraction result of a text. If it is `FALSE`, the return will be a character vector with extracted words. See detail.
`only_unique`	if it is `TRUE`, only unique words are returned. The default is `FALSE`. See detail.
`keep_name`	whether to keep the tag names of the extracted words. The default is `FALSE`. Note: if `only_unique = TRUE`, all tag names will be removed.
`checks`	whether to check the correctness of arguments. The default is `TRUE`.

Details

The Argument each and only_unique decide what kind of return you can get.

if each = TRUE and only_unique = FALSE, you can get a list, each element of which contains words extracted. This is the default.
if each = TRUE and only_unique = TRUE, each element of the list only contains unique words.
if each = FALSE and only_unique = FALSE, all words extracted will be put into a single vector.
if each = FALSE and only_unique = TRUE, words extracted will be put into a single vector, but only unique words will be returned.

Examples

# No Chinese, so use English instead.
x1 <- c(v = "drink", xdrink = "coffee", v = "drink", xdrink = "cola", v = "eat", xfood = "banana")
x2 <- c(v = "drink", xdrink = "tea", v = "buy", x = "computer")
x <- list(x1, x2)
get_tag_word(x, tag = "v", mycutter = NULL)
get_tag_word(x, tag = "v", mycutter = NULL, only_unique = TRUE)
get_tag_word(x, tag_pattern = "^x", mycutter = NULL)
get_tag_word(x, tag_pattern = "^x", mycutter = NULL, keep_name = TRUE)
get_tag_word(x, tag = "v", mycutter = NULL, each = FALSE)
get_tag_word(x, tag = "v", mycutter = NULL, each = FALSE, only_unique = TRUE)
get_tag_word(x, tag = "v", mycutter = NULL, type = "position")

[Package chinese.misc version 0.2.3 Index]