stri_enc_detect2 {stringi} | R Documentation |
[DEPRECATED] Detect Locale-Sensitive Character Encoding
Description
This function tries to detect character encoding in case the language of text is known.
Usage
stri_enc_detect2(str, locale = NULL)
Arguments
str |
character vector, a raw vector, or
a list of |
locale |
|
Details
Vectorized over str
.
First, the text is checked whether it is valid
UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE, UTF-8
(as in stri_enc_detect
,
this is roughly inspired by ICU's i18n/csrucode.cpp
) or ASCII.
If locale
is not NA
and the above fails,
the text is checked for the number of occurrences
of language-specific code points (data provided by the ICU library)
converted to all possible 8-bit encodings
that fully cover the indicated language.
The encoding is selected based on the greatest number of total
byte hits.
The guess is of course imprecise, as it is obtained using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that is in a single language.
If you have no initial guess on the language and encoding, try with
stri_enc_detect
(uses ICU facilities).
Value
Just like stri_enc_detect
,
this function returns a list of length equal to the length of str
.
Each list element is a data frame with the following three named components:
-
Encoding
– string; guessed encodings;NA
on failure (if and only ifencodings
is empty), -
Language
– alwaysNA
, -
Confidence
– numeric in [0,1]; the higher the value, the more confidence there is in the match;NA
on failure.
The guesses are ordered by decreasing confidence.
Author(s)
Marek Gagolewski and other contributors
See Also
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02
Other locale_sensitive:
%s<%()
,
about_locale
,
about_search_boundaries
,
about_search_coll
,
stri_compare()
,
stri_count_boundaries()
,
stri_duplicated()
,
stri_extract_all_boundaries()
,
stri_locate_all_boundaries()
,
stri_opts_collator()
,
stri_order()
,
stri_rank()
,
stri_sort_key()
,
stri_sort()
,
stri_split_boundaries()
,
stri_trans_tolower()
,
stri_unique()
,
stri_wrap()
Other encoding_detection:
about_encoding
,
stri_enc_detect()
,
stri_enc_isascii()
,
stri_enc_isutf16be()
,
stri_enc_isutf8()