R: Match Two Data Sets by Location

matchDatasets {shiftR}

R Documentation

Match Two Data Sets by Location

Description

The goal of this function is to match records in the data sets for subsequent enrichment analysis.

For each record in the primary data set (data1) it finds the record in the auxiliary data set (data1) which overlap with it or lie within the flanking distance (flank). If multiple such auxiliary record are found, we select the one with the center closest to the center of the primary record. If no such record is available, no matching is made for the primary record.

Usage

matchDatasets(data1, data2, flank = 0)

Arguments

data1

A data frame with the primary data set, must have at least 4 columns:

Chromosome name.
Start position.
End position.
P-value or test statistic.
Optional additional columns.

data2

A data frame with the auxiliary data set.
Must satisfy the same format criteria as the primary data set.

flank

Allowed distance between matched records.
Set to zero to require overlap.

Value

Returns a list with matched data sets.

`data1`	The primary data sets without unmatched records.
`data2`	The auxiliary data set records matching those in `data1` above. Note that some auxiliary records can get duplicated if they are the best match for multiple records in the primary data.

Note

For a technical reason, the chromosome positions are assumed to be no greater than 1e9.

Author(s)

Andrey A Shabalin andrey.shabalin@gmail.com

Examples


data1 = read.csv(text =
"chr,start,end,stat
chr1,100,200,1
chr1,150,250,2
chr1,200,300,3
chr1,300,400,4
chr1,997,997,5
chr1,998,998,6
chr1,999,999,7")

data2 = read.csv(text =
"chr,start,end,stat
chr1,130,130,1
chr1,140,140,2
chr1,165,165,3
chr1,200,200,4
chr1,240,240,5
chr1,340,340,6
chr1,350,350,7
chr1,360,360,8
chr1,900,900,9")

# Match data sets exactly.
matchDatasets(data1, data2, 0)

# Match data sets with a flank.
# The last records are now matched.
matchDatasets(data1, data2, 100)

[Package shiftR version 1.5 Index]