R: Implementation of AdaSampling for positive unlabelled and...

adaSample {AdaSampling}

R Documentation

Implementation of AdaSampling for positive unlabelled and label noise learning.

Description

adaSample() applies the AdaSampling procedure to reduce noise in the training set, and subsequently trains a classifier from the new training set. For each row (observation) in the test set, it returns the probabilities of it being a positive ("P) or negative ("N") instance, as a two column data frame.

Usage

adaSample(Ps, Ns, train.mat, test.mat, classifier = "svm", s = 1,
  C = 1, sampleFactor = 1, weights = NULL)

Arguments

`Ps`	names (each instance in the data has to be named) of positive examples
`Ns`	names (each instance in the data has to be named) of negative examples
`train.mat`	training data matrix, without class labels.
`test.mat`	test data matrix, without class labels.
`classifier`	classification algorithm to be used for learning. Current options are support vector machine, `"svm"`, k-nearest neighbour, `"knn"`, logistic regression `"logit"`, linear discriminant analysis `"lda"`, and feature weighted knn, `"wKNN"`.
`s`	sets the seed.
`C`	sets how many times to run the classifier, C>1 induces an ensemble learning model.
`sampleFactor`	provides a control on the sample size for resampling.
`weights`	feature weights, required when using weighted knn.

Details

adaSample() is an adaptive sampling-based noise reduction method to deal with noisy class labelled data, which acts as a wrapper for traditional classifiers, such as support vector machines, k-nearest neighbours, logistic regression, and linear discriminant analysis.

This process is used to build up a noise-minimized training set that is derived by iteratively resampling the training set, (train) based on probabilities derived after its classification.

This sampled training set is then used to train a classifier, which is then executed on the test set. adaSample() returns a series of predictions for each row of the test set.

Note that this function does not evaluate the quality of the model and thus does not compare its output to true values of the test set. To assess please see adaSvmBenchmark().

Value

a two column matrix providing classification probabilities of each sample with respect to positive and negative classes

References

Yang, P., Liu, W., Yang. J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conferences on Artificial Intelligence (IJCAI), 3272-3279

Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J.(2018) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, doi:10.1109/TCYB.2018.2816984

Examples

# Load the example dataset
data(brca)
head(brca)

# First, clean up the dataset to transform into the required format.
brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric)
brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)})
rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_")

# Introduce 40% noise to positive class and 30% noise to the negative class
set.seed(1)
pos <- which(brca.cls == 1)
neg <- which(brca.cls == 0)
brca.cls.noisy <- brca.cls
brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0
brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1

# Identify positive and negative examples from the noisy dataset
Ps <- rownames(brca.mat)[which(brca.cls.noisy == 1)]
Ns <- rownames(brca.mat)[which(brca.cls.noisy == 0)]

# Apply AdaSampling method on the noisy data
brca.preds <- adaSample(Ps, Ns, train.mat=brca.mat, test.mat=brca.mat, classifier = "knn")
head(brca.preds)

# Orignal accuracy from the labels
accuracy <- sum(brca.cls.noisy == brca.cls) / length(brca.cls)
accuracy

# Accuracy after applying AdaSampling method
accuracyWithAdaSample <- sum(ifelse(brca.preds[,"P"] > 0.5, 1, 0) == brca.cls) / length(brca.cls)
accuracyWithAdaSample

[Package AdaSampling version 1.3 Index]