get_tag_word {chinese.misc}R Documentation

Extract Words of Some Certain Tags through Pos-Tagging

Description

Given a group of Chinese texts, this function manages to extract words of some specified types. For example, sometimes you want to collect all verbs that are used in your texts. Note: this function uses jiebaR::tagging to segment texts and do pos-tagging. The types assigned are not all correct. So, alternatively, you can first pos-tag your texts with other methods and then use this function.

Usage

get_tag_word(
  x,
  tag = NULL,
  tag_pattern = NULL,
  mycutter = DEFAULT_cutter,
  type = "word",
  each = TRUE,
  only_unique = FALSE,
  keep_name = FALSE,
  checks = TRUE
)

Arguments

x

it must be a list of character vectors, even when the list contains only one element. Each element of the list is either a length 1 character vector of a text, or a length >= 1 character vector which is the result of former tagging work. It should not contain NA.

tag

one or more tags should be specified. Words with these tags will be chosen. Possible tags are "v", "n", "vn", etc.

tag_pattern

should be a length 1 regular expression. You can specify tags by this pattern rather than directly provide tag names. For example, you can specify tag names starting with "n" by tag_pattern = "^n". At least and at most one of tag and tag_pattern should be NULL.

mycutter

a cutter created with package jiebaR and given by users to tag texts. If your texts have already been pos-tagged, you can set this to NULL. By default, a DEFAULT_cutter is used, which is assigned as worker(write = FALSE) when loading the package.

type

if it is "word" (default), then extract the words that match your tags. If it is "position", only the positions of the words are returned. Note: if it is "positions", argument each (see below) will always be set to TRUE.

each

if this is TRUE (default), the return will be a list, each element of which is a extraction result of a text. If it is FALSE, the return will be a character vector with extracted words. See detail.

only_unique

if it is TRUE, only unique words are returned. The default is FALSE. See detail.

keep_name

whether to keep the tag names of the extracted words. The default is FALSE. Note: if only_unique = TRUE, all tag names will be removed.

checks

whether to check the correctness of arguments. The default is TRUE.

Details

The Argument each and only_unique decide what kind of return you can get.

Examples

# No Chinese, so use English instead.
x1 <- c(v = "drink", xdrink = "coffee", v = "drink", xdrink = "cola", v = "eat", xfood = "banana")
x2 <- c(v = "drink", xdrink = "tea", v = "buy", x = "computer")
x <- list(x1, x2)
get_tag_word(x, tag = "v", mycutter = NULL)
get_tag_word(x, tag = "v", mycutter = NULL, only_unique = TRUE)
get_tag_word(x, tag_pattern = "^x", mycutter = NULL)
get_tag_word(x, tag_pattern = "^x", mycutter = NULL, keep_name = TRUE)
get_tag_word(x, tag = "v", mycutter = NULL, each = FALSE)
get_tag_word(x, tag = "v", mycutter = NULL, each = FALSE, only_unique = TRUE)
get_tag_word(x, tag = "v", mycutter = NULL, type = "position")

[Package chinese.misc version 0.2.3 Index]