parseName {Ecfun} | R Documentation |
Parse surname and given name
Description
Identify the presumed surname in a character
string assumed to represent a name and return
the result in a character matrix with
surname
followed by givenName
.
If only one name is provided (without
punctuation), it is assumed to be the
givenName
; see Wikipedia,
"Given name"
and "Surname".
Usage
parseName(x,
surnameFirst=(median(regexpr(',', x))>0),
suffix=c('Jr.', 'I', 'II', 'III', 'IV',
'Sr.', 'Dr.', 'Jr', 'Sr'),
fixNonStandard=subNonStandardNames,
removeSecondLine=TRUE,
namesNotFound="attr.replacement", ...)
Arguments
x |
a character vector |
surnameFirst |
logical: If TRUE, the surname comes first
followed by a comma (","), then the given
name. If FALSE, parse the surname from a
standard Western "John Smith, Jr." format.
If |
suffix |
character vector of strings that are NOT a surname but might appear at the end without a comma that would otherwise identify it as a suffix. |
fixNonStandard |
function to look for and repair
nonstandard names such as names
containing characters with accent marks
that are sometimes mangled
by different software. Use
|
removeSecondLine |
logical: If TRUE, delete anything
following "\n" and return it as
an attribute |
namesNotFound |
character vector passed to
|
... |
optional arguments
passed to |
Details
If surnameFirst
is FALSE
:
1. If the last character is ")" and the matching "(" is 3 characters earlier, drop all that stuff. Thus, "John Smith (AL)" becomes "John Smith".
2. Look for commas to identify a suffix like Jr. or III; remove and call the rest x2.
3. split <- strsplit(x2, " ")
4. Take the last as the surname.
5. If the "surname" found per 3 is in
suffix
, save to append it to the
givenName
and recurse to get the
actual surname.
NOTE: This gives the wrong answer with double surnames written without a hyphen in the Spanish tradition, in which, e.g., "Anastasio Somoza Debayle", "Somoza Debayle" give the (first) surnames of Anastasio's father and mother, respectively: The current algorithm would return "Debayle" as the surname, which is incorrect.
6. Recompose the rest with any suffix as
the givenName
.
Value
a character matrix with two columns:
surname and givenName
.
This matrix also has a
namesNotFound
attribute if one is
returned by subNonStandardNames
.
Author(s)
Spencer Graves
See Also
strsplit
identity
subNonStandardNames
Examples
##
## 1. Parse standard first-last name format
##
tstParse <- c('Joe Smith (AL)', 'Teresa Angelica Sanchez de Gomez',
'John Brown, Jr.', 'John Brown Jr.',
'John W. Brown III', 'John Q. Brown,I',
'Linda Rosa Smith-Johnson', 'Anastasio Somoza Debayle',
'Ra_l Vel_zquez', 'Sting', 'Colette, ')
parsed <- parseName(tstParse)
tstParse2 <- matrix(c('Smith', 'Joe', 'Gomez', 'Teresa Angelica Sanchez de',
'Brown', 'John, Jr.', 'Brown', 'John, Jr.',
'Brown', 'John W., III', 'Brown', 'John Q., I',
'Smith-Johnson', 'Linda Rosa', 'Debayle', 'Anastasio Somoza',
'Velazquez', 'Raul', '', 'Sting', 'Colette', ''),
ncol=2, byrow=TRUE)
# NOTE: The 'Anastasio Somoza Debayle' is in the Spanish tradition
# and is handled incorrectly by the current algorithm.
# The correct answer should be "Somoza Debayle", "Anastasio".
# However, fixing that would complicate the algorithm excessively for now.
colnames(tstParse2) <- c("surname", 'givenName')
all.equal(parsed, tstParse2)
##
## 2. Parse "surname, given name" format
##
tst3 <- c('Smith (AL),Joe', 'Sanchez de Gomez, Teresa Angelica',
'Brown, John, Jr.', 'Brown, John W., III', 'Brown, John Q., I',
'Smith-Johnson, Linda Rosa', 'Somoza Debayle, Anastasio',
'Vel_zquez, Ra_l', ', Sting', 'Colette,')
tst4 <- parseName(tst3)
tst5 <- matrix(c('Smith', 'Joe', 'Sanchez de Gomez', 'Teresa Angelica',
'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I',
'Smith-Johnson', 'Linda Rosa', 'Somoza Debayle', 'Anastasio',
'Velazquez', 'Raul', '','Sting', 'Colette',''),
ncol=2, byrow=TRUE)
colnames(tst5) <- c("surname", 'givenName')
all.equal(tst4, tst5)
##
## 3. secondLine
##
L2 <- parseName(c('Adam\n2nd line', 'Ed \n --Vacancy', 'Frank'))
# check
L2. <- matrix(c('', 'Adam', '', 'Ed', '', 'Frank'),
ncol=2, byrow=TRUE)
colnames(L2.) <- c('surname', 'givenName')
attr(L2., 'secondLine') <- c('2nd line', ' --Vacancy', NA)
all.equal(L2, L2.)
##
## 4. Force surnameFirst when in a minority
##
snf <- c('Sting', 'Madonna', 'Smith, Al')
SNF <- parseName(snf, surnameFirst=TRUE)
# check
SNF2 <- matrix(c('', 'Sting', '', 'Madonna', 'Smith', 'Al'),
ncol=2, byrow=TRUE)
colnames(SNF2) <- c('surname', 'givenName')
all.equal(SNF, SNF2)
##
## 5. nameNotFound
##
noSub <- parseName('xx_x')
# check
noSub. <- matrix(c('', 'xx_x'), 1)
colnames(noSub.) <- c('surname', 'givenName')
attr(noSub., 'namesNotFound') <- 'xx_x'
all.equal(noSub, noSub.)