subNonStandardNames {Ecfun} | R Documentation |
sub for nonstandard names
Description
sub(nonStandardNames[, 1],
nonStandardNames[, 2], x)
Accented characters common in non-English languages often get mangled in different ways by different software. For example, the "e" in "Andre" may carry an accent that gets replaced by other characters by different software.
This function first converts "Andr*"
to "Andr_"
for any character "*" not
in standardCharacters
. It then looks
for "Andr_"
in nonStandardNames
.
By default, it will find that and replace it
with "Andre".
Usage
subNonStandardNames(x,
standardCharacters=c(letters, LETTERS, ' ',
'.', '?', '!', ',', 0:9, '/', '*', '$',
'%', '\"', "\'", '-', '+', '&', '_', ';',
'(', ')', '[', ']', '\n'),
replacement='_',
gsubList=list(list(pattern=
'\\\\\\\\|\\\\',
replacement='\"')),
removeSecondLine=TRUE,
nonStandardNames=Ecdat::nonEnglishNames,
namesNotFound="attr.replacement", ...)
Arguments
x |
character vector or matrix or a
|
standardCharacters , replacement , gsubList , ... |
arguments passed to
|
removeSecondLine |
logical: If |
nonStandardNames |
|
namesNotFound |
character vector describing how to treat
substitutions not found in
NOTE: x = "_" will be identified by
|
Details
1. removeSecondLine
s
2. x. <- subNonStandardCharacters(x,
standardCharacters, replacement, ...)
3. Loop over all rows of
nonStandardNames
substituting
anything matching
nonStandardNames[i, 1]
with
nonStandardNames[i, 2]
.
4. Eliminate leading and trailing blanks.
5. if(is.matrix(x))
return a matrix;
if(is.data.frame(x))
return a
data.frame(..., stringsAsFactors=FALSE)
NOTE: On 13 May 2013 Jeff Newmiller at the
University of California, Davis, wrote, 'I
think it is a fools errand to think that you
can automatically "normalize" arbitrary Unicode
characters to an ASCII form that everyone will
agree on.' (This was a reply on
r-help@r-project.org
, subject: "Re: [R]
Matching names with non-English characters".)
Doubtless someone has software to do a better
job of this than what this function does, but
I've so far been unable to find it in R. If
you know of a better solution to this problem,
I'd be pleased to hear from you. Spencer Graves
Value
a character vector with all
nonStandardCharacters
replaced first by
replacement
and then by the second
column of nonStandardNames
for any that
match the first column. If a secondLine
is found on any elements, it is returned as a
secondLine
attribute.
If any names with nonStandardCharacters
are not found in nonStandardNames[, 1]
,
they are identified in an optional attribute
per the namesNotFound
argument.
Author(s)
Spencer Graves
See Also
sub
nonEnglishNames
subNonStandardCharacters
stripBlanks
Examples
##
## 1. Example
##
tstSNSN <- c('Raul', 'Ra`l', 'Torres,Raul',
'Torres, Ra`l', "Robert C. \\Bobby\\\\",
'Ed \n --Vacancy', '', ' ')
# confusion in character sets can create
# names like Names[2]
##
## 2. subNonStandardNames(vector)
##
SNS2 <- subNonStandardNames(tstSNSN)
SNS2
# check
SNS2. <- c('Raul', 'Raul', 'Torres,Raul', 'Torres, Raul',
'Robert C. "Bobby"', 'Ed', '', '')
attr(SNS2., 'secondLine') <- c(rep(NA, 5), ' --Vacancy',
NA, NA)
all.equal(SNS2, SNS2.)
##
## 3. subNonStandardNames(matrix)
##
tstmat <- parseName(tstSNSN, surnameFirst=TRUE)
submat <- subNonStandardNames(tstmat)
# check
SNSmat <- parseName(SNS2., surnameFirst=TRUE)
all.equal(submat, SNSmat)
##
## 4. subNonStandardNames(data.frame)
##
tstdf <- as.data.frame(tstmat)
subdf <- subNonStandardNames(tstdf)
# check
SNSdf <- as.data.frame(SNSmat, stringsAsFactors=FALSE)
all.equal(subdf, SNSdf)
##
## 5. namesNotFound
##
noSub <- subNonStandardNames('xx_x')
# check
noSub. <- 'xx_x'
attr(noSub., 'namesNotFound') <- 'xx_x'
all.equal(noSub, noSub.)