uniqtag {uniqtag} | R Documentation |
Abbreviate strings to short, unique identifiers.
Description
Abbreviate strings to unique substrings of k
characters.
Usage
uniqtag(xs, k = 9, uniq = make_unique_all_or_none, sep = "-")
Arguments
xs |
a character vector |
k |
the size of the identifier, an integer |
uniq |
a function to make the abbreviations unique, such as make_unique, make_unique_duplicates, make_unique_all_or_none, make_unique_all, make.unique, or to disable this function, identity or NULL |
sep |
a character string used to separate a duplicate string from its sequence number |
Details
For each string in a set of strings, determine a unique tag that is a substring of fixed size k
unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size k
is called the UniqTag of that string.
The lexicographically smallest substring depend on the locale's sort order.
You may wish to first call Sys.setlocale("LC_COLLATE", "C")
Value
a character vector of the UniqTags of the strings x
See Also
abbreviate, locales, make.unique
Examples
Sys.setlocale("LC_COLLATE", "C")
states <- sub(" ", "", state.name)
uniqtags <- uniqtag(states)
uniqtags4 <- uniqtag(states, k = 4)
uniqtags3 <- uniqtag(states, k = 3)
uniqtags3x <- uniqtag(states, k = 3, uniq = make_unique)
table(nchar(states))
table(nchar(uniqtags))
table(nchar(uniqtags4))
table(nchar(uniqtags3))
table(nchar(uniqtags3x))
uniqtags3[grep("-", uniqtags3x)]