R: Identify the row of 'y' best matching each row of 'x'

match.data.frame {Ecfun}

R Documentation

Identify the row of `y` best matching each row of `x`

Description

For each row of x[, by.x], find the best matching row of y[, by.y], with the best match defined by grep. and split.

grep. and split must either be missing or have the same length as by.x and by.y. If grep.[i] and split[i] are NA, do a complete match of x[, by.x[i]] and y[, by.y[i]]. Otherwise, for each row j, look for a match for strsplit(x[j, by.x[i]], split[i])[[1]][1] among strsplit(y[, by.y[i]], split[i]). See details.

Usage

match.data.frame(x, y, by, by.x=by, by.y=by, 
        grep., split, sep=':')

Arguments

`x`, `y`	data.frames
`by`, `by.x`, `by.y`	names of columns of `x` and `y` to match.
`grep.`	a character vector of the type of match for each element of `by.x` and `by.y`. If `NA`, require a perfect match. Alternatives are `grep` and `agrep` to find a match for the first segment in `strsplit(x, split=split[i])` among any of the segments of `strsplit(y, split=split[i])`. Use `fixed=TRUE` with the calls to these functions. NOTE: These alternatives are not examined if a unique match is found between `x[, by.x[is.na(grep.) & is.na(split)]]` and the corresponding columns of `y`.
`split`	A character vector of `split` characters to pass to `strsplit`; `strsplit` is not called if `is.na(split)`.
`sep`	a `sep` argument to use with `paste` to produce a matching key for the columns of `x` and `y` for which perfect matches are required. `If(missing(sep) && not(missing(grep.))) sep <- ' '` except where `grep.` = `NA`s.

Details

1. Check by.x, by.y, grep. and split. If((missing(by.x) | missing(by.y)) && missing(by)) by <- names(x)

2. fullMatch <- (is.na(grep.) & is .na(split)). Create keyfx and keyfy by by pasting columns of x[, by.x[fullMatch]] and y[, by.y[fullMatch]]. Also create x. and y. = strsplit of x[, by.x[!fullMatch]].

3. Iterate over rows of x looking for the best match. This includes an inner loop over columns of x[, by.x[!fullMatch]], stopping on the first unique match. Return (-1) if no unique match is found.

Value

an integer vector of length nrow(x) containing the index of the best matching row of y or NA if no adequate match was found.

Author(s)

Spencer Graves

Examples

newdata <- data.frame(state=c("AL", "MI","NY"),
                      surname=c("Rogers", "Rogers", "Smith"),
                      givenName=c("Mike R.", "Mike K.", "Al"),
                      stringsAsFactors=FALSE)
reference <- data.frame(state=c("NY", "NY", "MI", "AL", "NY", "MI"),
                      surname=c("Smith", "Rogers", "Rogers (MI)",
                                "Rogers (AL)", "Smith", 'Jones'),
                      givenName=c("John", "Mike", "Mike", "Mike",
                                "T. Albert", 'Al Thomas'),
                      stringsAsFactors=FALSE)
newInRef <- match.data.frame(newdata, reference,
       grep.=c(NA, 'agrep', 'agrep'))


all.equal(newInRef, c(4, 3, 5))

[Package Ecfun version 0.3-2 Index]

Identify the row of y best matching each row of x