R: Gaussian-mixture borderline label noise

gaum_bor_ln {noisemodel}

R Documentation

Gaussian-mixture borderline label noise

Description

Introduction of Gaussian-mixture borderline label noise into a classification dataset.

Usage

## Default S3 method:
gaum_bor_ln(
  x,
  y,
  level,
  mean = c(0, 2),
  sd = c(sqrt(0.5), sqrt(0.5)),
  w = c(0.5, 0.5),
  k = 1,
  sortid = TRUE,
  ...
)

## S3 method for class 'formula'
gaum_bor_ln(formula, data, ...)

Arguments

`x`	a data frame of input attributes.
`y`	a factor vector with the output class of each sample.
`level`	a double in [0,1] with the noise level to be introduced.
`mean`	a double vector with the mean for each Gaussian distribution (default: `c`(0,2)).
`sd`	a double vector with the standard deviation for each Gaussian distribution (default: `c`(`sqrt`(0.5),`sqrt`(0.5))).
`w`	a double vector with the weight for each Gaussian distribution (default: `c`(0.5,0.5)).
`k`	an integer with the number of nearest neighbors to be used (default: 1).
`sortid`	a logical indicating if the indices must be sorted at the output (default: `TRUE`).
`...`	other options to pass to the function.
`formula`	a formula with the output class and, at least, one input attribute.
`data`	a data frame in which to interpret the variables in the formula.

Details

Gaussian-mixture borderline label noise uses an SVM to induce the decision border in the dataset. For each sample, its distance to the decision border is computed. Then, a Gaussian mixture distribution with parameters (mean, sd) and weights w is used to compute the value for the probability density function associated to each distance. Finally, (level·100)% of the samples in the dataset are randomly selected to be mislabeled according to their values of the probability density function. For each noisy sample, the majority class among its k-nearest neighbors of a different class is chosen as the new label.

Value

An object of class ndmodel with elements:

`xnoise`	a data frame with the noisy input attributes.
`ynoise`	a factor vector with the noisy output class.
`numnoise`	an integer vector with the amount of noisy samples per class.
`idnoise`	an integer vector list with the indices of noisy samples.
`numclean`	an integer vector with the amount of clean samples per class.
`idclean`	an integer vector list with the indices of clean samples.
`distr`	an integer vector with the samples per class in the original data.
`model`	the full name of the noise introduction model used.
`param`	a list of the argument values.
`call`	the function call.

Note

Noise model adapted from the papers in References, considering SVM with linear kernel as classifier, a mislabeling process using the neighborhood of noisy samples and a noise level to control the number of errors in the data.

References

J. Bootkrajang and J. Chaijaruwanich. Towards instance-dependent label noise-tolerant classification: a probabilistic approach. Pattern Analysis and Applications, 23(1):95-111, 2020. doi:10.1007/s10044-018-0750-z.

Examples

# load the dataset
data(iris2D)

# usage of the default method
set.seed(9)
outdef <- gaum_bor_ln(x = iris2D[,-ncol(iris2D)], y = iris2D[,ncol(iris2D)], level = 0.1)

# show results
summary(outdef, showid = TRUE)
plot(outdef)

# usage of the method for class formula
set.seed(9)
outfrm <- gaum_bor_ln(formula = Species ~ ., data = iris2D, level = 0.1)

# check the match of noisy indices
identical(outdef$idnoise, outfrm$idnoise)

[Package noisemodel version 1.0.2 Index]