R: Abbreviate strings to short, unique identifiers.

uniqtag {uniqtag}

R Documentation

Abbreviate strings to short, unique identifiers.

Description

Abbreviate strings to unique substrings of k characters.

Usage

uniqtag(xs, k = 9, uniq = make_unique_all_or_none, sep = "-")

Arguments

`xs`	a character vector
`k`	the size of the identifier, an integer
`uniq`	a function to make the abbreviations unique, such as make_unique, make_unique_duplicates, make_unique_all_or_none, make_unique_all, make.unique, or to disable this function, identity or NULL
`sep`	a character string used to separate a duplicate string from its sequence number

Details

For each string in a set of strings, determine a unique tag that is a substring of fixed size k unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size k is called the UniqTag of that string.

The lexicographically smallest substring depend on the locale's sort order. You may wish to first call Sys.setlocale("LC_COLLATE", "C")

Value

a character vector of the UniqTags of the strings x

Examples

Sys.setlocale("LC_COLLATE", "C")
states <- sub(" ", "", state.name)
uniqtags <- uniqtag(states)
uniqtags4 <- uniqtag(states, k = 4)
uniqtags3 <- uniqtag(states, k = 3)
uniqtags3x <- uniqtag(states, k = 3, uniq = make_unique)
table(nchar(states))
table(nchar(uniqtags))
table(nchar(uniqtags4))
table(nchar(uniqtags3))
table(nchar(uniqtags3x))
uniqtags3[grep("-", uniqtags3x)]

[Package uniqtag version 1.0.1 Index]