seg_file {chinese.misc} | R Documentation |
Convenient Tool to Segment Chinese Texts
Description
The function first collects filenames or text vectors, then it
calls jiebaR::segment
to segment texts. In
this process, it allows users to do additional modification.
File encoding is detected automatically.
After segmenting, segmented words that belong to a text will be pasted
together into a single character with words split by " ".
The segmented result will be returned or written
on the disk.
Usage
seg_file(
...,
from = "dir",
folder = NULL,
mycutter = DEFAULT_cutter,
enc = "auto",
myfun1 = NULL,
myfun2 = NULL,
special = "",
ext = "txt"
)
Arguments
... |
names of folders, files, or the mixture of the two kinds. It can also be a character
vector of text to be processed when setting |
from |
should only be "dir" or "v".
If your inputs are filenames, it should be "dir" (default),
If the inputs is a character vector of texts, it should be "v". However, if it is set to "v",
make sure each element of the vector is not identical to filename in your working
directory; if they are identical, an error will be raised.
To do this check is because if they are identical, the function
|
folder |
a length 1 character indicating the folder to put the segmented text.
Set it to |
mycutter |
the jiebar cutter to segment text. A default cutter is used. See Details. |
enc |
the file encoding used to read files. If files have different encodings or you do not know their encodings, set it to "auto" (default) to let encodings be detected automatically. |
myfun1 |
a function used to modify each text after being read by |
myfun2 |
a function used to modify each text after they are segmented. |
special |
a length 1 character or regular expression to be passed to |
ext |
the extension of written files. Should be "txt", "rtf" or "". If it is not one of the three, it is set to "". This is only used when your input is a text vector rather than filenames and you want to write the outcome into your disk. |
Details
Users should provide their jiebar cutter by mycutter
. Otherwise, the function
uses DEFAULT_cutter
which is created when the package is loaded.
The DEFAULT_cutter
is simply worker(write = FALSE)
.
See jiebaR::worker
.
As long as
you have not manually created another variable called "DEFAULT_cutter",
you can directly use jiebaR::new_user_word(DEFAULT_cutter...)
to add new words. By the way, whether you manually create an object
called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is
used by default by functions in this package will not be removed by you.
So, whenever you want to use this default value, either you do not set
mycutter
, or
set it to mycutter = chinese.misc::DEFAULT_cutter
.
The encoding for writing files (if folder
is not NULL) is always "UTF-8".
Value
a character vector, each element is a segmented text, with words split by " ".
If folder
is a folder name, the result will be written into your disk and
nothing returns.
Examples
require(jiebaR)
# No Chinese word is allowed, so we use English here.
x <- c("drink a bottle of milk",
"drink a cup of coffee",
"DRINK SOME WATER")
seg_file(x, from = "v", myfun1 = tolower)