seg_file {chinese.misc}R Documentation

Convenient Tool to Segment Chinese Texts

Description

The function first collects filenames or text vectors, then it calls jiebaR::segment to segment texts. In this process, it allows users to do additional modification. File encoding is detected automatically. After segmenting, segmented words that belong to a text will be pasted together into a single character with words split by " ". The segmented result will be returned or written on the disk.

Usage

seg_file(
  ...,
  from = "dir",
  folder = NULL,
  mycutter = DEFAULT_cutter,
  enc = "auto",
  myfun1 = NULL,
  myfun2 = NULL,
  special = "",
  ext = "txt"
)

Arguments

...

names of folders, files, or the mixture of the two kinds. It can also be a character vector of text to be processed when setting from to "v", see below.

from

should only be "dir" or "v". If your inputs are filenames, it should be "dir" (default), If the inputs is a character vector of texts, it should be "v". However, if it is set to "v", make sure each element of the vector is not identical to filename in your working directory; if they are identical, an error will be raised. To do this check is because if they are identical, the function segment will take the input as a file to read!

folder

a length 1 character indicating the folder to put the segmented text. Set it to NULL if you want the result to be a character vector rather than to be written on your disk. Otherwise, it should be a valid directory path, each segmented text will be written into a .txt/.rtf file. If the specified folder does not exist, the function will try to create it.

mycutter

the jiebar cutter to segment text. A default cutter is used. See Details.

enc

the file encoding used to read files. If files have different encodings or you do not know their encodings, set it to "auto" (default) to let encodings be detected automatically.

myfun1

a function used to modify each text after being read by scancn and before being segmented.

myfun2

a function used to modify each text after they are segmented.

special

a length 1 character or regular expression to be passed to dir_or_file to specify what pattern should be met by filenames. The default is to read all files.

ext

the extension of written files. Should be "txt", "rtf" or "". If it is not one of the three, it is set to "". This is only used when your input is a text vector rather than filenames and you want to write the outcome into your disk.

Details

Users should provide their jiebar cutter by mycutter. Otherwise, the function uses DEFAULT_cutter which is created when the package is loaded. The DEFAULT_cutter is simply worker(write = FALSE). See jiebaR::worker.

As long as you have not manually created another variable called "DEFAULT_cutter", you can directly use jiebaR::new_user_word(DEFAULT_cutter...) to add new words. By the way, whether you manually create an object called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is used by default by functions in this package will not be removed by you. So, whenever you want to use this default value, either you do not set mycutter, or set it to mycutter = chinese.misc::DEFAULT_cutter.

The encoding for writing files (if folder is not NULL) is always "UTF-8".

Value

a character vector, each element is a segmented text, with words split by " ". If folder is a folder name, the result will be written into your disk and nothing returns.

Examples

require(jiebaR)
# No Chinese word is allowed, so we use English here.
x <- c("drink a bottle of milk", 
  "drink a cup of coffee", 
 "DRINK SOME WATER")
seg_file(x, from = "v", myfun1 = tolower)

[Package chinese.misc version 0.2.3 Index]