trainSupv {RecordLinkage} | R Documentation |
Train a Classifier
Description
Trains a classifier for supervised classification of record pairs.
Usage
trainSupv(rpairs, method, use.pred = FALSE, omit.possible = TRUE,
convert.na = TRUE, include.data = FALSE, ...)
Arguments
rpairs |
Object of class |
method |
A character vector. The classification method to use. |
use.pred |
Logical. Whether to use results of an unsupervised classification instead of true matching status. |
omit.possible |
Logical. Whether to remove pairs labeled as possible links or with unknown status. |
convert.na |
Logical. Whether to convert |
include.data |
Logical. Whether to include training data in the result object. |
... |
Further arguments to the training method. |
Details
The given dataset is used as training data for a supervised classification.
Either the true matching status has to be known for a sufficient number of
data pairs or the data must have been classified previously, e.g. by using
emClassify
or classifyUnsup
. In the latter case,
argument use.pred
has to be set to TRUE
.
A classifying method has to be provided as a character string (factors are
converted to character) through argument method
.
The supported classifiers are:
"svm"
Support vector machine, see
svm
."rpart"
Recursive partitioning tree, see
rpart
."ada"
Stochastic boosting model, see
ada
."bagging"
Bagging with classification trees, see
bagging
."nnet"
Single-hidden-layer neural network, see
nnet
."bumping"
A bootstrap based method using classification trees, see details.
Arguments in ...
are passed to the corresponding function.
Most classifiers cannot handle NA
s in the data, so by default these
are converted to 0 before training.
By omit.possible = TRUE
, possible links or pairs with unknown status
are excluded from the training set. Setting this argument to FALSE
allows three-class-classification (links, non-links and possible links), but
the results tend to be poor.
Leaving include.data=FALSE
saves memory, setting it to TRUE
can be useful for saving the classificator while keeping track of the underlying training data.
Bumping, (acronym for “Bootstrap umbrella of model
parameters”), is an ensemble method described by Tibshirani and Knight,
1999. Such as in bagging, multiple classifiers are trained on bootstrap
samples of the training set. The key difference is that not the aggregated
decision of all classifiers (e.g. by majority vote) is used to classify new
data, but only the single model that performs best on the whole training set.
In combination with classification trees as underlying classifiers this
approach allows good interpretability of the trained model while being more
stable against outliers than traditionally induced decision trees. The number
of bootstrap samples to use can be controlled by supplying the argument
n.bootstrap
, which defaults to 25.
Value
An object of class RecLinkClassif
with the following components:
train |
If |
model |
The model returned by the underlying training function. |
method |
A copy of the argument |
Author(s)
Andreas Borg, Murat Sariyar
References
Tibshirani R, Knight K: Model search by bootstrap “bumping”. Journal of Computational and Graphical Statistics 8(1999):671–686.
See Also
classifySupv
for classifying with the trained model,
classifyUnsup
for unsupervised classification
Examples
# Train a rpart decision tree with additional parameter minsplit
data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500,
blockfld=list(1,3,5,6,7))
model=trainSupv(pairs, method="rpart", minsplit=5)
summary(model)