adaSvmBenchmark {AdaSampling} | R Documentation |
Benchmarking AdaSampling efficacy on noisy labelled data.
Description
adaSvmBenchmark()
allows a comparison between the performance
of an AdaSampling-enhanced SVM (support vector machine)-
classifier against the SVM-classifier on its
own. It requires a matrix of features (extracted from a labelled dataset),
and two vectors of true labels and labels with noise added as desired.
It runs an SVM classifier and returns a matrix which displays the specificity
(Sp), sensitivity (Se) and F1 score for each of four conditions:
"Original" (classifying with true labels), "Baseline" (classifying with
noisy labels), "AdaSingle" (classifying using AdaSampling) and
"AdaEnsemble" (classifying using AdaSampling in conjunction with
an ensemble of models).
Usage
adaSvmBenchmark(data.mat, data.cls, data.cls.truth, cvSeed, C = 50,
sampleFactor = 1)
Arguments
data.mat |
a rectangular matrix or data frame that can be coerced to a matrix, containing the features of the dataset, without class labels. Rownames (possibly containing unique identifiers) will be ignored. |
data.cls |
a numeric vector containing class labels for the dataset
with added noise.
Must be in the same order and of the same length as |
data.cls.truth |
a numeric vector of true class labels for
the dataset. Must be the same order and of the same length as |
cvSeed |
sets the seed for cross-validation. |
C |
sets how many times to run the classifier, for the AdaEnsemble condition. See Description above. |
sampleFactor |
provides a control on the sample size for resampling. |
Details
AdaSampling is an adaptive sampling-based noise reduction method
to deal with noisy class labelled data, which acts as a wrapper for
traditional classifiers, such as support vector machines,
k-nearest neighbours, logistic regression, and linear discriminant
analysis. For more details see ?adaSample()
.
This function runs evaluates the AdaSampling procedure by adding noise
to a labelled dataset, and then running support vector machines on
the original and the noisy dataset. Note that this function is for
benchmarking AdaSampling performance using what is assumed to be
a well-labelled dataset. In order to run AdaSampling on a noisy dataset,
please see adaSample()
.
Value
performance matrix
References
Yang, P., Liu, W., Yang. J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conferences on Artificial Intelligence (IJCAI), 3272-3279
Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J.(2018) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, doi:10.1109/TCYB.2018.2816984
Examples
# Load the example dataset
data(brca)
head(brca)
# First, clean up the dataset to transform into the required format.
brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric)
brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)})
rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_")
# Introduce 40% noise to positive class and 30% noise to the negative class
set.seed(1)
pos <- which(brca.cls == 1)
neg <- which(brca.cls == 0)
brca.cls.noisy <- brca.cls
brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0
brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1
# benchmark classification performance with different approaches
adaSvmBenchmark(data.mat = brca.mat, data.cls = brca.cls.noisy, data.cls.truth = brca.cls, cvSeed=1)