CrossValidationSSL {RSSL} | R Documentation |
Cross-validation in semi-supervised setting
Description
Cross-validation for semi-supervised learning, in which the dataset is split in three parts: labeled training object, unlabeled training object and validation objects. This can be used to evaluate different approaches to semi-supervised classification under the assumption the labels are missing at random. Different cross-validation schemes are implemented. See below for details.
Usage
CrossValidationSSL(X, y, ...)
## S3 method for class 'list'
CrossValidationSSL(X, y, ..., verbose = FALSE, mc.cores = 1)
## S3 method for class 'matrix'
CrossValidationSSL(X, y, classifiers, measures = list(Error
= measure_error), k = 10, repeats = 1, verbose = FALSE,
leaveout = "test", n_labeled = 10, prop_unlabeled = 0.5, time = TRUE,
pre_scale = FALSE, pre_pca = FALSE, n_min = 1, low_level_cores = 1,
...)
Arguments
X |
design matrix of the labeled objects |
y |
vector with labels |
... |
arguments passed to underlying functions |
verbose |
logical; Controls the verbosity of the output |
mc.cores |
integer; Number of cores to be used |
classifiers |
list; Classifiers to crossvalidate |
measures |
named list of functions giving the measures to be used |
k |
integer; Number of folds in the cross-validation |
repeats |
integer; Number of repeated assignments to folds |
leaveout |
either "labeled" or "test", see details |
n_labeled |
Number of labeled examples, used in both leaveout modes |
prop_unlabeled |
numeric; proportion of unlabeled objects |
time |
logical; Whether execution time should be saved. |
pre_scale |
logical; Whether the features should be scaled before the dataset is used |
pre_pca |
logical; Whether the features should be preprocessed using a PCA step |
n_min |
integer; Minimum number of labeled objects per class |
low_level_cores |
integer; Number of cores to use compute repeats of the learning curve |
Details
The input to this function can be either: a dataset in the form of a feature matrix and factor containing the labels, a dataset in the form of a formula and data.frame or a named list of these two options.
There are two main modes in which the cross-validation can be carried out, controlled by the leaveout
parameter.
When leaveout is "labeled", the folds are formed by non-overlapping labeled training sets of a user specified size.
Each of these folds is used as a labeled set, while the rest of the objects are split into the an unlabeled and the test set, controlled by prop_unlabeled
parameter. Note that objects can be used multiple times for testing, when training on a different fold, while other objects may never used for testing.
The "test" option of leaveout
, on the other hand, uses the folds as the test sets. This means every object will be used as a test object exactly once. The remaining objects in each training iteration are split randomly into a labeled and an unlabeled part, where the number of the labeled objects is controlled by the user through the n_labeled parameter.
Examples
X <- model.matrix(Species~.-1,data=iris)
y <- iris$Species
classifiers <- list("LS"=function(X,y,X_u,y_u) {
LeastSquaresClassifier(X,y,lambda=0)},
"EM"=function(X,y,X_u,y_u) {
SelfLearning(X,y,X_u,
method=LeastSquaresClassifier)}
)
measures <- list("Accuracy" = measure_accuracy,
"Loss" = measure_losstest,
"Loss labeled" = measure_losslab,
"Loss Lab+Unlab" = measure_losstrain
)
# Cross-validation making sure test folds are non-overlapping
cvresults1 <- CrossValidationSSL(X,y,
classifiers=classifiers,
measures=measures,
leaveout="test", k=10,
repeats = 2,n_labeled = 10)
print(cvresults1)
plot(cvresults1)
# Cross-validation making sure labeled sets are non-overlapping
cvresults2 <- CrossValidationSSL(X,y,
classifiers=classifiers,
measures=measures,
leaveout="labeled", k=10,
repeats = 2,n_labeled = 10,
prop_unlabeled=0.5)
print(cvresults2)
plot(cvresults2)