setred {SSLR} | R Documentation |
General Interface for SETRED model
Description
SETRED (SElf-TRaining with EDiting) is a variant of the self-training
classification method (as implemented in the function selfTraining
) with a different addition mechanism.
The SETRED classifier is initially trained with a
reduced set of labeled examples. Then, it is iteratively retrained with its own most
confident predictions over the unlabeled examples. SETRED uses an amending scheme
to avoid the introduction of noisy examples into the enlarged labeled set. For each
iteration, the mislabeled examples are identified using the local information provided
by the neighborhood graph.
Usage
setred(
dist = "Euclidean",
learner,
theta = 0.1,
max.iter = 50,
perc.full = 0.7,
D = NULL
)
Arguments
dist |
A distance function or the name of a distance available
in the |
learner |
model from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions (or optionally a distance matrix) and it's corresponding classes. |
theta |
Rejection threshold to test the critical region. Default is 0.1. |
max.iter |
maximum number of iterations to execute the self-labeling process. Default is 50. |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-training process is stopped. Default is 0.7. |
D |
A distance matrix between all the training instances. This matrix is used to construct the neighborhood graph. Default is NULL, this means the method create a matrix with dist param |
Details
SETRED initiates the self-labeling process by training a model from the original
labeled set. In each iteration, the learner
function detects unlabeled
examples for which it makes the most confident prediction and labels those examples
according to the pred
function. The identification of mislabeled examples is
performed using a neighborhood graph created from the distance matrix.
Most examples possess the same label in a neighborhood. So if an example locates
in a neighborhood with too many neighbors from different classes, this example should
be considered problematic. The value of the theta
argument controls the confidence
of the candidates selected to enlarge the labeled set. The lower this value is, the more
restrictive is the selection of the examples that are considered good.
For more information about the self-labeled process and the rest of the parameters, please
see selfTraining
.
Value
(When model fit) A list object of class "setred" containing:
- model
The final base classifier trained using the enlarged labeled set.
- instances.index
The indexes of the training instances used to train the
model
. These indexes include the initial labeled instances and the newly labeled instances. Those indexes are relative tox
argument.- classes
The levels of
y
factor.- pred
The function provided in the
pred
argument.- pred.pars
The list provided in the
pred.pars
argument.
References
Ming Li and ZhiHua Zhou.
Setred: Self-training with editing.
In Advances in Knowledge Discovery and Data Mining, volume 3518 of Lecture Notes in
Computer Science, pages 611-621. Springer Berlin Heidelberg, 2005.
ISBN 978-3-540-26076-9. doi: 10.1007/11430919 71.
Examples
library(tidyverse)
library(tidymodels)
library(caret)
library(SSLR)
data(wine)
set.seed(1)
train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE)
train <- wine[ train.index,]
test <- wine[-train.index,]
cls <- which(colnames(wine) == "Wine")
#% LABELED
labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE)
train[-labeled.index,cls] <- NA
#We need a model with probability predictions from parsnip
#https://tidymodels.github.io/parsnip/articles/articles/Models.html
#It should be with mode = classification
#For example, with Random Forest
rf <- rand_forest(trees = 100, mode = "classification") %>%
set_engine("randomForest")
m <- setred(learner = rf,
theta = 0.1,
max.iter = 2,
perc.full = 0.7) %>% fit(Wine ~ ., data = train)
#Accuracy
predict(m,test) %>%
bind_cols(test) %>%
metrics(truth = "Wine", estimate = .pred_class)
#Another example, with dist matrix
distance <- as.matrix(proxy::dist(train[,-cls], method ="Euclidean",
by_rows = TRUE, diag = TRUE, upper = TRUE))
m <- setred(learner = rf,
theta = 0.1,
max.iter = 2,
perc.full = 0.7,
D = distance) %>% fit(Wine ~ ., data = train)
#Accuracy
predict(m,test) %>%
bind_cols(test) %>%
metrics(truth = "Wine", estimate = .pred_class)