| splitData {RecordLinkage} | R Documentation |
Split Data
Description
Splits a data set into two sets with desired proportions.
Usage
splitData(dataset, prop, keep.mprop = FALSE, num.non = 0, des.mprop = 0,
use.pred = FALSE)
Arguments
dataset |
Object of class |
prop |
Real number between 0 and 1. Proportion of data pairs to form the training set. |
keep.mprop |
Logical. Whether the ratio of matches should be retained. |
num.non |
Positive Integer. Desired number on non-matches in the training set. |
des.mprop |
Real number between 0 and 1. Desired proportion of matches to non-matches in the training set. |
use.pred |
Logical. Whether to apply match ratio to previous classification results instead of true matching status. |
Value
A list of RecLinkData objects.
train |
The sampled training data. |
valid |
All other record pairs |
The sampled data are stored in the pairs attributes of train
and valid. If present, the attributes prediction and Wdata
are split and the corresponding values saved. All other attributes are
copied to both data sets.
If the number of desired matches or non-matches is higher than the number actually present in the data, the maximum possible number is chosen and a warning issued.
Author(s)
Andreas Borg, Murat Sariyar
See Also
genSamples for generating training data based on
unsupervised classification.
Examples
data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500,
blockfld=list(1,3,5,6,7))
# split into halves, do not enforce match ratio
l=splitData(pairs, prop=0.5)
summary(l$train)
summary(l$valid)
# split into 1/3 and 2/3, retain match ration
l=splitData(pairs, prop=1/3, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)
# generate a training set with 100 non-matches and 10 matches
l=splitData(pairs, num.non=100, des.mprop=0.1, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)