R: Parse surname and given name

parseName {Ecfun}

R Documentation

Parse surname and given name

Description

Identify the presumed surname in a character string assumed to represent a name and return the result in a character matrix with surname followed by givenName. If only one name is provided (without punctuation), it is assumed to be the givenName; see Wikipedia, "Given name" and "Surname".

Usage

parseName(x, 
    surnameFirst=(median(regexpr(',', x))>0),
    suffix=c('Jr.', 'I', 'II', 'III', 'IV', 
              'Sr.', 'Dr.', 'Jr', 'Sr'),
    fixNonStandard=subNonStandardNames, 
    removeSecondLine=TRUE, 
    namesNotFound="attr.replacement", ...)

Arguments

`x`	a character vector
`surnameFirst`	logical: If TRUE, the surname comes first followed by a comma (","), then the given name. If FALSE, parse the surname from a standard Western "John Smith, Jr." format. If `missing(surnameFirst)`, use TRUE if half of the elements of `x` contain a comma.
`suffix`	character vector of strings that are NOT a surname but might appear at the end without a comma that would otherwise identify it as a suffix.
`fixNonStandard`	function to look for and repair nonstandard names such as names containing characters with accent marks that are sometimes mangled by different software. Use `identity` if this is not desired.
`removeSecondLine`	logical: If TRUE, delete anything following "\n" and return it as an attribute `secondLine`.
`namesNotFound`	character vector passed to `subNonStandardNames` and used to compute any `namesNotFound` attribute of the object returned by `parseName`.
`...`	optional arguments passed to `fixNonStandard`

Details

If surnameFirst is FALSE:

1. If the last character is ")" and the matching "(" is 3 characters earlier, drop all that stuff. Thus, "John Smith (AL)" becomes "John Smith".

2. Look for commas to identify a suffix like Jr. or III; remove and call the rest x2.

3. split <- strsplit(x2, " ")

4. Take the last as the surname.

5. If the "surname" found per 3 is in suffix, save to append it to the givenName and recurse to get the actual surname.

NOTE: This gives the wrong answer with double surnames written without a hyphen in the Spanish tradition, in which, e.g., "Anastasio Somoza Debayle", "Somoza Debayle" give the (first) surnames of Anastasio's father and mother, respectively: The current algorithm would return "Debayle" as the surname, which is incorrect.

6. Recompose the rest with any suffix as the givenName.

Value

a character matrix with two columns: surname and givenName.

This matrix also has a namesNotFound attribute if one is returned by subNonStandardNames.

Author(s)

Spencer Graves

Examples

##
## 1.  Parse standard first-last name format
##
tstParse <- c('Joe Smith (AL)', 'Teresa Angelica Sanchez de Gomez',
         'John Brown, Jr.', 'John Brown Jr.',
         'John W. Brown III', 'John Q. Brown,I',
         'Linda Rosa Smith-Johnson', 'Anastasio Somoza Debayle',
         'Ra_l Vel_zquez', 'Sting', 'Colette, ')
parsed <- parseName(tstParse)

tstParse2 <- matrix(c('Smith', 'Joe', 'Gomez', 'Teresa Angelica Sanchez de',
  'Brown', 'John, Jr.', 'Brown', 'John, Jr.',
  'Brown', 'John W., III', 'Brown', 'John Q., I',
  'Smith-Johnson', 'Linda Rosa', 'Debayle', 'Anastasio Somoza',
  'Velazquez', 'Raul', '', 'Sting', 'Colette', ''),
  ncol=2, byrow=TRUE)
# NOTE:  The 'Anastasio Somoza Debayle' is in the Spanish tradition
# and is handled incorrectly by the current algorithm.
# The correct answer should be "Somoza Debayle", "Anastasio".
# However, fixing that would complicate the algorithm excessively for now.
colnames(tstParse2) <- c("surname", 'givenName')


all.equal(parsed, tstParse2)


##
## 2.  Parse "surname, given name" format
##
tst3 <- c('Smith (AL),Joe', 'Sanchez de Gomez, Teresa Angelica',
     'Brown, John, Jr.', 'Brown, John W., III', 'Brown, John Q., I',
     'Smith-Johnson, Linda Rosa', 'Somoza Debayle, Anastasio',
     'Vel_zquez, Ra_l', ', Sting', 'Colette,')
tst4 <- parseName(tst3)

tst5 <- matrix(c('Smith', 'Joe', 'Sanchez de Gomez', 'Teresa Angelica',
  'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I',
  'Smith-Johnson', 'Linda Rosa', 'Somoza Debayle', 'Anastasio',
  'Velazquez', 'Raul', '','Sting', 'Colette',''),
  ncol=2, byrow=TRUE)
colnames(tst5) <- c("surname", 'givenName')


all.equal(tst4, tst5)


##
## 3.  secondLine 
##
L2 <- parseName(c('Adam\n2nd line', 'Ed  \n --Vacancy', 'Frank'))

# check 
L2. <- matrix(c('', 'Adam', '', 'Ed', '', 'Frank'), 
              ncol=2, byrow=TRUE)
colnames(L2.) <- c('surname', 'givenName')
attr(L2., 'secondLine') <- c('2nd line', ' --Vacancy', NA)

all.equal(L2, L2.)


##
## 4.  Force surnameFirst when in a minority 
##
snf <- c('Sting', 'Madonna', 'Smith, Al')
SNF <- parseName(snf, surnameFirst=TRUE)

# check 
SNF2 <- matrix(c('', 'Sting', '', 'Madonna', 'Smith', 'Al'), 
               ncol=2, byrow=TRUE)
colnames(SNF2) <- c('surname', 'givenName')               

all.equal(SNF, SNF2)


##
## 5.  nameNotFound
##
noSub <- parseName('xx_x')

# check 
noSub. <- matrix(c('', 'xx_x'), 1)
colnames(noSub.) <- c('surname', 'givenName')               
attr(noSub., 'namesNotFound') <- 'xx_x'

all.equal(noSub, noSub.)

[Package Ecfun version 0.3-2 Index]