R: Benchmarking AdaSampling efficacy on noisy labelled data.

adaSvmBenchmark {AdaSampling}

R Documentation

Benchmarking AdaSampling efficacy on noisy labelled data.

Description

adaSvmBenchmark() allows a comparison between the performance of an AdaSampling-enhanced SVM (support vector machine)- classifier against the SVM-classifier on its own. It requires a matrix of features (extracted from a labelled dataset), and two vectors of true labels and labels with noise added as desired. It runs an SVM classifier and returns a matrix which displays the specificity (Sp), sensitivity (Se) and F1 score for each of four conditions: "Original" (classifying with true labels), "Baseline" (classifying with noisy labels), "AdaSingle" (classifying using AdaSampling) and "AdaEnsemble" (classifying using AdaSampling in conjunction with an ensemble of models).

Usage

adaSvmBenchmark(data.mat, data.cls, data.cls.truth, cvSeed, C = 50,
  sampleFactor = 1)

Arguments

`data.mat`	a rectangular matrix or data frame that can be coerced to a matrix, containing the features of the dataset, without class labels. Rownames (possibly containing unique identifiers) will be ignored.
`data.cls`	a numeric vector containing class labels for the dataset with added noise. Must be in the same order and of the same length as `data.mat`. Labels must be 1 for positive observations, and 0 for negative observations.
`data.cls.truth`	a numeric vector of true class labels for the dataset. Must be the same order and of the same length as `data.mat`. Labels must be 1, for positive observations, and 0 for negative observations.
`cvSeed`	sets the seed for cross-validation.
`C`	sets how many times to run the classifier, for the AdaEnsemble condition. See Description above.
`sampleFactor`	provides a control on the sample size for resampling.

Details

AdaSampling is an adaptive sampling-based noise reduction method to deal with noisy class labelled data, which acts as a wrapper for traditional classifiers, such as support vector machines, k-nearest neighbours, logistic regression, and linear discriminant analysis. For more details see ?adaSample().

This function runs evaluates the AdaSampling procedure by adding noise to a labelled dataset, and then running support vector machines on the original and the noisy dataset. Note that this function is for benchmarking AdaSampling performance using what is assumed to be a well-labelled dataset. In order to run AdaSampling on a noisy dataset, please see adaSample().

Value

performance matrix

References

Yang, P., Liu, W., Yang. J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conferences on Artificial Intelligence (IJCAI), 3272-3279

Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J.(2018) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, doi:10.1109/TCYB.2018.2816984

Examples

# Load the example dataset
data(brca)
head(brca)

# First, clean up the dataset to transform into the required format.
brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric)
brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)})
rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_")

# Introduce 40% noise to positive class and 30% noise to the negative class
set.seed(1)
pos <- which(brca.cls == 1)
neg <- which(brca.cls == 0)
brca.cls.noisy <- brca.cls
brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0
brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1

# benchmark classification performance with different approaches

adaSvmBenchmark(data.mat = brca.mat, data.cls = brca.cls.noisy, data.cls.truth = brca.cls, cvSeed=1)

[Package AdaSampling version 1.3 Index]