setred {ssc}R Documentation

SETRED method

Description

SETRED (SElf-TRaining with EDiting) is a variant of the self-training classification method (as implemented in the function selfTraining) with a different addition mechanism. The SETRED classifier is initially trained with a reduced set of labeled examples. Then, it is iteratively retrained with its own most confident predictions over the unlabeled examples. SETRED uses an amending scheme to avoid the introduction of noisy examples into the enlarged labeled set. For each iteration, the mislabeled examples are identified using the local information provided by the neighborhood graph.

Usage

setred(x, y, x.inst = TRUE, dist = "Euclidean", learner,
  learner.pars = NULL, pred = "predict", pred.pars = NULL,
  theta = 0.1, max.iter = 50, perc.full = 0.7)

Arguments

x

A object that can be coerced as matrix. This object has two possible interpretations according to the value set in the x.inst argument: a matrix with the training instances where each row represents a single instance or a precomputed (distance or kernel) matrix between the training examples.

y

A vector with the labels of the training instances. In this vector the unlabeled instances are specified with the value NA.

x.inst

A boolean value that indicates if x is or not an instance matrix. Default is TRUE.

dist

A distance function or the name of a distance available in the proxy package to compute the distance matrix in the case that x.inst is TRUE.

learner

either a function or a string naming the function for training a supervised base classifier, using a set of instances (or optionally a distance matrix) and it's corresponding classes.

learner.pars

A list with additional parameters for the learner function if necessary. Default is NULL.

pred

either a function or a string naming the function for predicting the probabilities per classes, using the base classifier trained with the learner function. Default is "predict".

pred.pars

A list with additional parameters for the pred function if necessary. Default is NULL.

theta

Rejection threshold to test the critical region. Default is 0.1.

max.iter

maximum number of iterations to execute the self-labeling process. Default is 50.

perc.full

A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-training process is stopped. Default is 0.7.

Details

SETRED initiates the self-labeling process by training a model from the original labeled set. In each iteration, the learner function detects unlabeled examples for which it makes the most confident prediction and labels those examples according to the pred function. The identification of mislabeled examples is performed using a neighborhood graph created from the distance matrix. When x.inst is TRUE this distance matrix is computed using the dist function. On the other hand, when x.inst is FALSE the matrix provided with x is used both to train a classifier and to create the neighborhood graph. Most examples possess the same label in a neighborhood. So if an example locates in a neighborhood with too many neighbors from different classes, this example should be considered problematic. The value of the theta argument controls the confidence of the candidates selected to enlarge the labeled set. The lower this value is, the more restrictive is the selection of the examples that are considered good. For more information about the self-labeled process and the rest of the parameters, please see selfTraining.

Value

A list object of class "setred" containing:

model

The final base classifier trained using the enlarged labeled set.

instances.index

The indexes of the training instances used to train the model. These indexes include the initial labeled instances and the newly labeled instances. Those indexes are relative to x argument.

classes

The levels of y factor.

pred

The function provided in the pred argument.

pred.pars

The list provided in the pred.pars argument.

References

Ming Li and ZhiHua Zhou.
Setred: Self-training with editing.
In Advances in Knowledge Discovery and Data Mining, volume 3518 of Lecture Notes in Computer Science, pages 611-621. Springer Berlin Heidelberg, 2005. ISBN 978-3-540-26076-9. doi: 10.1007/11430919 71.

Examples


library(ssc)

## Load Wine data set
data(wine)

cls <- which(colnames(wine) == "Wine")
x <- wine[, -cls] # instances without classes
y <- wine[, cls] # the classes
x <- scale(x) # scale the attributes

## Prepare data
set.seed(20)
# Use 50% of instances for training
tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5))
xtrain <- x[tra.idx,] # training instances
ytrain <- y[tra.idx]  # classes of training instances
# Use 70% of train instances as unlabeled set
tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7))
ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances

# Use the other 50% of instances for inductive testing
tst.idx <- setdiff(1:length(y), tra.idx)
xitest <- x[tst.idx,] # testing instances
yitest <- y[tst.idx] # classes of testing instances

## Example: Training from a set of instances with 1-NN as base classifier.
m1 <- setred(x = xtrain, y = ytrain, dist = "euclidean", 
            learner = caret::knn3, 
            learner.pars = list(k = 1),
            pred = "predict")
pred1 <- predict(m1, xitest)
table(pred1, yitest)

## Example: Training from a distance matrix with 1-NN as base classifier.
# Compute distances between training instances
library(proxy)
D <- dist(x = xtrain, method = "euclidean", by_rows = TRUE)

m2 <- setred(x = D, y = ytrain, x.inst = FALSE,
            learner = ssc::oneNN, 
            pred = "predict",
            pred.pars = list(distance.weighting = "none"))
ditest <- proxy::dist(x = xitest, y = xtrain[m2$instances.index,],
                      method = "euclidean", by_rows = TRUE)
pred2 <- predict(m2, ditest)
table(pred2, yitest)

## Example: Training from a set of instances with SVM as base classifier.
learner <- e1071::svm
learner.pars <- list(type = "C-classification", kernel="radial", 
                     probability = TRUE, scale = TRUE)
pred <- function(m, x){
  r <- predict(m, x, probability = TRUE)
  prob <- attr(r, "probabilities")
  prob
}
m3 <- setred(x = xtrain, y = ytrain, dist = "euclidean", 
             learner = learner, 
             learner.pars = learner.pars, 
             pred = pred)
pred3 <- predict(m3, xitest)
table(pred3, yitest)

## Example: Training from a set of instances with Naive-Bayes as base classifier.
m4 <- setred(x = xtrain, y = ytrain, dist = "euclidean",
             learner = function(x, y) e1071::naiveBayes(x, y), 
             pred = "predict",
             pred.pars = list(type = "raw"))
pred4 <- predict(m4, xitest)
table(pred4, yitest)

## Example: Training from a set of instances with C5.0 as base classifier.
m5 <- setred(x = xtrain, y = ytrain, dist = "euclidean",
             learner = C50::C5.0, 
             pred = "predict",
             pred.pars = list(type = "prob"))
pred5 <- predict(m5, xitest)
table(pred5, yitest)



[Package ssc version 2.1-0 Index]