splitData {RecordLinkage} | R Documentation |
Split Data
Description
Splits a data set into two sets with desired proportions.
Usage
splitData(dataset, prop, keep.mprop = FALSE, num.non = 0, des.mprop = 0,
use.pred = FALSE)
Arguments
dataset |
Object of class |
prop |
Real number between 0 and 1. Proportion of data pairs to form the training set. |
keep.mprop |
Logical. Whether the ratio of matches should be retained. |
num.non |
Positive Integer. Desired number on non-matches in the training set. |
des.mprop |
Real number between 0 and 1. Desired proportion of matches to non-matches in the training set. |
use.pred |
Logical. Whether to apply match ratio to previous classification results instead of true matching status. |
Value
A list of RecLinkData
objects.
train |
The sampled training data. |
valid |
All other record pairs |
The sampled data are stored in the pairs
attributes of train
and valid
. If present, the attributes prediction
and Wdata
are split and the corresponding values saved. All other attributes are
copied to both data sets.
If the number of desired matches or non-matches is higher than the number actually present in the data, the maximum possible number is chosen and a warning issued.
Author(s)
Andreas Borg, Murat Sariyar
See Also
genSamples
for generating training data based on
unsupervised classification.
Examples
data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500,
blockfld=list(1,3,5,6,7))
# split into halves, do not enforce match ratio
l=splitData(pairs, prop=0.5)
summary(l$train)
summary(l$valid)
# split into 1/3 and 2/3, retain match ration
l=splitData(pairs, prop=1/3, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)
# generate a training set with 100 non-matches and 10 matches
l=splitData(pairs, num.non=100, des.mprop=0.1, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)