R: Value Matching

matches {grr}

R Documentation

Value Matching

Description

Returns a lookup table or list of the positions of ALL matches of its first argument in its second and vice versa. Similar to match, though that function only returns the first match.

Usage

matches(x, y, all.x = TRUE, all.y = TRUE, list = FALSE, indexes = TRUE,
  nomatch = NA)

Arguments

`x`	vector. The values to be matched. Long vectors are not currently supported.
`y`	vector. The values to be matched. Long vectors are not currently supported.
`all.x`	logical; if `TRUE`, then each value in `x` will be included even if it has no matching values in `y`
`all.y`	logical; if `TRUE`, then each value in `y` will be included even if it has no matching values in `x`
`list`	logical. If `TRUE`, the result will be returned as a list of vectors, each vector being the matching values in y. If `FALSE`, result is returned as a data frame with repeated values for each match.
`indexes`	logical. Whether to return the indices of the matches or the actual values.
`nomatch`	the value to be returned in the case when no match is found. If not provided and `indexes=TRUE`, items with no match will be represented as `NA`. If set to `NULL`, items with no match will be set to an index value of `length+1`. If indexes=FALSE, they will default to `NA`.

Details

This behavior can be imitated by using joins to create lookup tables, but matches is simpler and faster: usually faster than the best joins in other packages and thousands of times faster than the built in merge.

all.x/all.y correspond to the four types of database joins in the following way:

left: all.x=TRUE, all.y=FALSE
right: all.x=FALSE, all.y=TRUE
inner: all.x=FALSE, all.y=FALSE
full: all.x=TRUE, all.y=TRUE

Note that NA values will match other NA values.

Examples

one<-as.integer(1:10000)
two<-as.integer(sample(1:10000,1e3,TRUE))
system.time(a<-lapply(one, function (x) which(two %in% x)))
system.time(b<-matches(one,two,all.y=FALSE,list=TRUE))

#Only retain items from one with a match in two
b<-matches(one,two,all.x=FALSE,all.y=FALSE,list=TRUE)
length(b)==length(unique(two))

one<-round(runif(1e3),3)
two<-round(runif(1e3),3)
system.time(a<-lapply(one, function (x) which(two %in% x)))
system.time(b<-matches(one,two,all.y=FALSE,list=TRUE))
 
one<-as.character(1:1e5)
two<-as.character(sample(1:1e5,1e5,TRUE))
system.time(b<-matches(one,two,list=FALSE))
system.time(c<-merge(data.frame(key=one),data.frame(key=two),all=TRUE))

## Not run: 
one<-as.integer(1:1000000)
two<-as.integer(sample(1:1000000,1e5,TRUE))
system.time(b<-matches(one,two,indexes=FALSE))
if(requireNamespace("dplyr",quietly=TRUE))
 system.time(c<-dplyr::full_join(data.frame(key=one),data.frame(key=two)))
if(require(data.table,quietly=TRUE))
 system.time(d<-merge(data.table(data.frame(key=one))
             ,data.table(data.frame(key=two))
             ,by='key',all=TRUE,allow.cartesian=TRUE))

one<-as.character(1:1000000)
two<-as.character(sample(1:1000000,1e5,TRUE))
system.time(a<-merge(one,two)) #Times out
system.time(b<-matches(one,two,indexes=FALSE))
if(requireNamespace("dplyr",quietly=TRUE))
 system.time(c<-dplyr::full_join(data.frame(key=one),data.frame(key=two)))#'
if(require(data.table,quietly=TRUE))
{
 system.time(d<-merge(data.table(data.frame(key=one))
             ,data.table(data.frame(key=two))
             ,by='key',all=TRUE,allow.cartesian=TRUE))
 identical(b[,1],as.character(d$key))
}

## End(Not run)

[Package grr version 0.9.5 Index]