R: Check character encoding in corpus folder

check.encoding {stylo}

R Documentation

Check character encoding in corpus folder

Description

Using non-ASCII characters is never trivial, but sometimes unavoidable. Specifically, most of the world's languages use non-Latin alphabets or diacritics added to the standard Latin script. The default character encoding in stylo is UTF-8, deviating from it can cause problems. This function allows users to check the character encoding in a corpus. A summary is returned to the termial and a detailed list reporting the most probable encodings of all the text files in the folder can be written to a csv file. The function is basically a wrapper around the function guess_encoding() from the 'readr' package by Wickham et al. (2017). To change the encoding to UTF-8, try the change.encoding() function.

Usage

check.encoding(corpus.dir = "corpus/", output.file = NULL)

Arguments

`corpus.dir`	path to the folder containing the corpus.
`output.file`	path to a csv file that reports the most probable encoding for each text file in the corpus.

Details

If no additional argument is passed, then the function tries to check the text files in the default subdirectory corpus.

Value

The function returns a summary message and writes detailed results into a csv file.

Author(s)

Steffen Pielström

References

Wickham , H., Hester, J., Francois, R., Jylanki, J., and Jørgensen, M. (2017). Package: 'readr'. <https://cran.r-project.org/web/packages/readr/readr.pdf>.

Examples

## Not run: 
# standard usage from stylo working directory with a 'corpus' subfolder:
check.encoding()

# specifying another folder:
check.encoding("~/corpora/example1/")

# specifying an output file:
check.encoding(output.file = "~/experiments/charencoding/example1.csv")


## End(Not run)

[Package stylo version 0.7.5 Index]